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Resumen 

Los dominios como bioinformatica, sistemas de versionamiento de codigo, sistemas de 
edicion colaborativos (wikis), y otros, producen grandes colecciones de texto que son 
sumamente repetitivas. Esto es, existen pocas diferencias entre los elementos de la 
coleccion. Esto permite que la compresibilidad de la coleccion sea extremadamente 
alta. Por ejemplo, una coleccion con versiones de un mismo articulo de Wikipedia 
puede ser comprimida a un 0.1% de su espacio original, utilizando el esquema de 
compresion Lempel-Ziv de 1977 (LZ77). 

Muchas de estas colecciones repetitivas contienen grandes volumenes de texto. Es 
por eso que se requiere un metodo que permita almacenarlas eficientemente y a la vez 
operar sobre ellas. Las operaciones mas comunes son extraer porciones aleatorias de 
la coleccion y encontrar todas las ocurrencias de un patron dentro de la coleccion. 

Un auto-mdice es una estructura que almacena un texto en forma comprimida y 
permite encontrar eficientemente las ocurrencias de un patron. Adicionalmente los 
auto- indices permiten extraer cualquier porcion de la coleccion. Uno de los objetivos 
de estos indices es que puedan ser almacenados en memoria principal. Esta carac- 
teristica es sumamente importante ya que el disco puede Uegar a ser un millon de 
veces mas lento que la memoria principal. 

La mayoria de los auto-indices existentes estan basados en un esquema de com- 
presion que predice los simbolos siguientes en base a una cantidad fija de simbolos 
anteriores. Este esquema, sin embargo, no funciona con textos repetitivos, ya que no 
es capaz de reconocer todos los elementos repetidos en la coleccion. Un esquema que 
si captura las repeticiones es el LZ77, pero tiene el problema de no poder acceder 
aleatoriamente el texto. 

En este trabajo se presenta un algoritmo para extraer substrings de un texto 
comprimido con un esquema Lempel-Ziv. Adicionalmente se presenta LZ-End, una 
variante de LZ77 que permite extraer el texto eficientemente usando espacio cercano 
al de LZ77. LZ77 extrae del orden de 1 millon de caracteres por segundo, mientras 
que LZ-End extrae mas del doble. 

Nuestro resultado mas importante es el desarrollo del primer auto-indice orientado 
a textos repetitivos basado en LZ77/LZ-End. Su desempefio supera al auto-indice 
RLCSA, el estado del arte para textos repetitivos. La compresion de nuestros indices 
Uega a ser dos veces mejor en ADN y colecciones de Wikipedia que la del RLCSA. 
Cabe destacar que nuestro indice basado en LZ77 se construye en 35% del tiempo 
requerido por el RLCSA, usando el 60% de espacio de construccion. La busqueda de 
patrones cortos es mas rapida que en el RLCSA, y para patrones largos la relacion 
entre espacio y tiempo es favorable a nuestros indices. 

Finalmente, se presenta tambien una coleccion de textos repetitivos provenientes 
de diversos dominios. Esta coleccion esta disponible publicamente con el objetivo que 
se pueda convertir en un referente en experimentacion. 
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Abstract 



Domains like bioinformatics, version control systems, collaborative editing systems 
(wiki), and others, are producing huge data collections that are very repetitive. That 
is, there are few differences between the elements of the collection. This fact makes 
the compressibility of the collection extremely high. For example, a collection with 
all different versions of a Wikipedia article can be compressed up to the 0.1% of its 
original space, using the Lempel-Ziv 1977 (LZ77) compression scheme. 

Many of these repetitive collections handle huge amounts of text data. For that 
reason, we require a method to store them efficiently, while providing the ability to 
operate on them. The most common operations are the extraction of random portions 
of the collection and the search for all the occurrences of a given pattern inside the 
whole collection. 

A self-index is a data structure that stores a text in compressed form and allows 
to find the occurrences of a pattern efficiently. On the other hand, self-indexes can 
extract any substring of the collection, hence they are able to replace the original 
text. One of the main goals when using these indexes is to store them within main 
memory. This characteristic is very important, as the disk may be 1 million times 
slower than main memory. 

Most current self- indexes are based on a compression scheme that predicts the 
following symbol based on the previous k symbols. However, this scheme is not well 
suited for repetitive texts as it does not capture long-range repetitions. The LZ77 
compression scheme does capture such repetitions, but it is not able to access the 
text at random. 

In this thesis we present a scheme for random text extraction from text compressed 
with a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ- 
End, that efficiently extracts text using space close to that of LZ77. LZ77 extracts 
around 1 million characters per second, while LZ-End extracts over 2 million. 

The main contribution of this thesis is the first self-index based on LZ77/LZ-End 
and oriented to repetitive texts, which outperforms the state of the art (the RLCSA 
self-index) in many aspects. The compression of our indexes is better than that of 
RLCSA, being two times better for DNA and for Wikipedia articles. Our index is 
built using just 60% of the space required by the RLCSA and within 35% of the time. 
Searching for short patterns is faster than on the RLCSA, and for longer patterns the 
space/time trade-off is in favor of our indexes. 

Finally, we present a corpus of repetitive texts, coming from several application 
domains. We aim at providing a standard set of texts for research and experimenta- 
tion, hence this corpus is publicly available. 
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Chapter 1 
Introduction 



In recent times we have seen a rise in the amount of digital information. This may 
be attributable to the drop of the data acquisition and storage costs. Most of this 
information is text, that is, symbol sequences representing natural language, music, 
source code, time series, biological sequences like DNA and proteins, and others. 

Despite that the examples presented above seem very different, there is an opera- 
tion that arises in most applications handling those types of sequences. This operation 
is called teoii search and consists in finding all positions on the text where a given 
pattern appears. This operation serves as a basis for building more complex and 
meaningful operations, like finding the most common words, or finding approximate 
patterns. 

Text search can be solved by two different approaches. The first scans the text 
sequentially looking for matches of the pattern. Classical examples of this type of 
search are Knuth-Morris-Pratt [KMP77] and Boyer-Moore [BM77] algorithms. The 
second way of searching is by querying an index of the text, a data structure we 
have to build before performing the queries. This structure allows us to find the 
occurrences of a given pattern without scanning the whole text. 

To index the text we need enough space in order to store the index, and most 
importantly we need to be able to access it efficiently. Nowadays, storage is not a 
difficult problem, however efficient access is. In the last years the speed of hard- 
drives has not experienced significant improvements. Hard-drive access times are 
around 10ms = lO^ns, while main memory access (RAM) is around 10ns; in other 
words, accessing secondary storage is 1 million times slower than accessing main 
memory. This problem is still present despite the appearance of solid state drives 
(SSD), which have access times around 0.1ms = lO^ns, being 10 thousand times 
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slower than main memory. For this reason, indexes using space proportional to the 
compressed text have been proposed, aiming at storing them in main memory and 
handling the data directly in compressed form, rather than decompressing before using 

it [ZdMNBYOO, NM07]. There are some indexes that, within that compressed space, 
are able to replace the original text; these are called self-indexes and are obviously 
preferable as one can discard the original text. 

A particular kind of texts not yet fully benefited by current self-indexes are repet- 
itive ones. These arise from domains that handle huge collections of very similar 
entries or documents. For example, in a DNA collection of human genomes of dif- 
ferent individuals, the similarity between any two DNA sequences would be close to 
99.9% [B+08]. Source code collections arc also very repetitive, as the changes be- 
tween one version and the next are not substantial, except in the case of a major 
release. Versioning systems, like wikis, also generate very repetitive collections be- 
cause each revision is very similar to the previous one. The main problem is that 
existing self-indexes do not sufficiently exploit these repetitions, being the self-index 
orders of magnitude larger than the space achievable with a compression scheme that 
does exploit the repetitions, like LZ77 [ZL77]. LZ77 parses the text into phrases so 
that each phrase, except its last letter, appears previously in the text (these previous 
occurrences are called sources)). It compresses by essentially replacing each phrase by 
a backward pointer. A recent work, aiming at adapting current self-indexes to handle 
large DNA databases of the same species [SVMN08] found that LZ77 compression 
was still much superior to capture this repetitiveness, yet it was inadequate as a for- 
mat for compressed storage because of its inability to retrieve individual sequences 
from the collection. Another work [CN09, CFMPNIO] shows that grammar-based 
compression can allow extraction of substrings while capturing such repetitions, yet 
LZ77 compression is superior to grammar compression [Ryt03, CLL+05]. 

For these reasons in this thesis we focus on the definition of a self-index oriented to 
repetitive texts and based on LZ77-likc compression schemes. Our main contributions 
are two: (1) a scheme for random text extraction in LZ77-like parsing, as well as a 
space-competitive variant, called LZ-End, achieving faster text extraction; (2) a self- 
index based on LZ77/LZ-End that achieves a better space/time trade-off than the 
best self- indexes oriented to repetitive texts. 

1.1 Contributions of the Thesis 

Chapter 3 : We create a public corpus of highly repetitive texts. The corpus 
is composed of texts coming from different real domains like biology, source 
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code repositories, document repositories, and others, as well as artificial texts 
having interesting combinatorial properties. This corpus is available at http: 
//pizzachili . dec . uchile . cl/ repcorpus . html. 

Chapter 4 : The worst-case extraction time of a substring of length ni in an LZ77 
parsing is OlmH), where H is the maximum number of times a character is 
transitively copied in the parsing. Wc present an alternative parsing, called 
LZ-End, that performs very close to LZ77 in terms of compression but permits 
faster text extraction, 0(m + H) worst-case time. This work was published in 
the 20th Data Compression Conference [KNIO]. 

Chapter 5 : We introduce a new self-index oriented to repetitive texts and based 
on the LZ77, LZ-End, and similar parsings. Let n' be the number of phrases 
of the parsing (for highly repetitive texts, n' will be a small value). This index 
uses in theory 2n' log n + n' log n' + n' log D + 0{n' log a) + o{n) bits of space, 
where a is the size of the alphabet and D is upper-bounded by the maximum 
number of sources covering each other. It finds the occ occurrences of a pattern 
of length m in time 0{m^H + m\ogn' + occ ■ Dlogn'). We present several 
practical variants that achieve better results, both in time and space, than the 
Run-length Compressed Suffix Array (RLCSA) [SVMN08] and the Grammar- 
based Self-index [CN09, CFMPNIO], the state-of-the art self-indexes oriented 
to repetitive texts. 

1.2 Outline of the Thesis 

Chapter 2 describes basic concepts and related work relevant to this thesis. 

Chapter 3 presents a text corpus intended for repetitive text. 

Chapter 4 explains the Lempel-Ziv (LZ) parsing and some of its properties. It 
also introduces a new LZ variant called LZ-End, able to extract an arbitrary 
substring in constant time per extracted symbol in some cases. 

Chapter 5 presents a new self-index based on LZ77-like parsings. It covers the 
theoretical proposal and the considerations we made when implementing the 
index. 

Chapter 6 shows the experimental results of our proposed index, comparing it with 
the state-of-the-art self-indexes for repetitive texts. 



3 



1.2 Outline of the Thesis Chapter 1 Introduction 



Chapter 7 presents our conclusions and gives some lines of research that can be 
further investigated. 
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In this chapter we introduce the basic concepts and notation used through this thesis. 
Then, we present the data structures used to build our index. Finally, we present two 
self-indexes oriented to repetitive texts. All logarithms in this thesis will be in base 
2 and we will assume that log = 0. 

2.1 Strings 

Definition 2.1. A string T is a sequence of characters drawn from an alphabet S. 
The alphabet is an ordered and finite set of size |S| = a. The i-th character of a 
string is represented as T[i]. The symbol e represents the empty string of length 0. 

Definition 2.2. Given a string T, and positions i and j, the substring of T starting at 
i and ending at j is defined as T[iJ] = T[i]T[i + 1] . . . T[j]. If i > j, then T[iJ] = e. 

Definition 2.3. Let T be a string of length n. The prefixes of T are the strings 
T[l,j],yO < j <n and its suffixes are the strings T[i,n], VI < i < n + 1. 

Definition 2.4. Let Ti, T2 be strings of length rii and ^2, respectively. We define 
the concatenation of these strings as T1T2 — Ti[l\... Ti[ni]T2[l] . . . T2[n2]. 

Definition 2.5. Given a string T of length n, the reverse of T is T^^'" — T[n\T[n — 

1] ...r[2]T[i]. 

Definition 2.6. The lexicographic order (<) between strings is defined as follows: 
Let a, b be characters in E and X, Y be strings over E. 

e<X,yX 
aX<bYifa<by{a^bAX<Y) 
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2.2 Search Problems 

Definition 2.7. Given a string T and a pattern P (a string of length m) both over an 
alphabet E, the occurrence positions of P in T are defined as O = {1+|X|, 3X, Y, T — 
XPY}. 

Definition 2.8. Given a string T and a pattern P, the following search problems are 
of interest: 

• exists{P, T) returns true iff P is in T, i.e., returns true iff O 7^ 0. 

• count{P, T) counts the number of occurrences of P in T, i.e., returns occ — \0\. 

• locate{P,T) finds the occurrences of P in T, i.e., returns the set O in some 
order. 

• extract(T,l,r) extracts the text substring T[l,r]. 

Remark 2.9. Note that exists and count can be answered after performing a locate 
query. 



2.3 Entropy 

Definition 2.10. Let T be a string of length n. The zero-th order empirical entropy 
is defined as 

Ur, , Tie 



n n 

where Uc is the number of times the character c appears in T, that is, Uc/n is the 
empirical probability of appearance of character c. 

It is worth noticing that the zero-th order entropy is invariant to permutations in 
the order of the text characters. The value nHo{T) is the least number of bits needed 
to represent T using a compressor that gives each character a fixed encoding. 

Definition 2.11. Let T be a string of length n. The k-th order empirical entropy 
[ManOl] is defined as 

\rpS\ 



where is the sequence composed of all characters preceded by string S in T. 
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The value nHk{T) is the least number of bits needed to represent T using a 
compressor that encodes each character taking into account the k preceding characters 
in T. This value assumes the first k characters are encoded for free, thus it gives a 
relevant lower bound only when n':$> k. 

Hk is a decreasing function in k, that is, 

< Hk{T) < Hk-i{T) <...< H,{T) < Ho{T) < log a. 



The following lemma yields the ground to show that the empirical entropy is 
not a good lower-bound measure for the compressibility of repetitive texts. 

Lemma 2.12. Let T be a string of length n. For any k <n it holds HkiTT) > Hk{T). 



Proof. As new relevant contexts may have arisen in the concatenation TT, we denote 
by C(T, k) the contexts of length k present in T, and C(TT, k) the contexts of TT. 
We have that C(T, k) C C{TT, k). The number of new contexts in TT is at most k. 
For each S e C{T,k), we have (TT)^ = T^^^T^, for some such that |^^| < k. 
Then, 



Hk{TT) = \{TTf\mTTf) 

' ' SeC{TT,k) 

2 .1. 



seC{T,k) 



' ' seC{T,k) 



T 

' ' S&C{T,k) 

Hk{T). 



In the first step we used C{T,k) C C{TT,k); in the second we used \T\Ho{T) < 
\TA\Ho{TA), for T = T^T^ and A = A'^ (since {T'^ A'^T^\Ho{T'^ A^T^) = 
iT^T^A^lHoiT'^T^A^)); and in the third we used Ho{TT) = Ho{T). The second 
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property holds because 



\TA\Ho{TA) = E« + -")l«g^^ 

T T , A 

^-^ ni + ni^ 

T T 

> > ^ log ^ 

= |r|//o(r) 

where is the number of occurrences of character c in string X, and = |X| for 
X — TotA. The last line is justified by the Gibbs inequality [Ham86]. □ 

It follows that \TT\Hk{TT) > 2\T\Hk{T), that is, to encode TT this model uses 
at least twice the space of the one used to encode T. An LZ77 encoding would need 
just one more phrase, as seen later. 



2.4 Encodings 

Most data structures need to represent symbols and numbers. Classic data structures 
use a fixed amount of space to store them, for example 1 byte for characters and 4 
bytes for integers. Instead, compressed data structures aim to use the minimum 
possible space, thus they represent symbols using variable-length prefix-free codes or 
just using a fixed amount b of bits, where b is as small as possible. Table 2.1 shows 
different encodings for the integers 1,. . . ,9, which we describe next. 

Unary Codes This representation is the simplest and serves as a basis for other 
coders. It represents a positive n as 1"~^0, thus it uses exactly n bits. 

Gamma Codes It represents a positive n by concatenating the length of its binary 
representation in unary and the binary representation of the symbol, omitting 
the most significant bit. The space is 2 [log n J -|- 1, [lognj -|- 1 for the length 
and [log n\ for the binary representation. 

Delta Codes This is an extension of 7-codes that works better on larger numbers. 
It represents the length of the binary representation of n using 7-codes and then 
n in binary without its most significant bit, thus using {2 [log ( [log n J -|- 1)J -|- 
+1}+ [lognj bits. 
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001100100 
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1110001 


110000001 
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001100101 



Table 2.1: Example of different coders 



Vbyte Coding [WZ99] It splits the [log(n + 1)J bits needed to represent n into 
blocks of b bits and stores each block in a chunk of 6 + 1 bits. The highest bit 
is in the chunk holding the most significant bits of n, and 1 in the rest of 
the chunks. For clarity we write the chunks from most to least significant, just 
like the binary representation of n. For example, if n = 25 = 11001 and b — 3, 
then we need two chunks and the representation is 0011 • 1001. Compared to 
an optimal encoding of [log(n + 1)J bits, this code loses one bit per b bits of n, 
plus possibly an almost empty final chunk. Even when the best choice for b is 
used, the total space achieved is still worse than 5-encoding's performance. In 
exchange, Vbyte codes are very fast to decode. 



2.4.1 Directly Addressable Codes 

In many cases wc need to store a set of numbers using the least possible space, 
yet providing fast random access to each element. Variable-length codes complicate 
this task, as they require storing, in addition, pointers to sampled positions of the 
encoded sequence. 

A simple sohition that shows good performance in practice is the so-called Directly 
Addressable Codes (DAC) [BLN09], a variant of Vbytes [WZ99]. They start with 
a sequence C = Ci, . . . , C„ of n integers. Then they compute the Vbyte encoding 
of each number. The least significant blocks are stored contiguously in an array Ai, 
and the highest bits of the least significant chunks arc stored in a bitmap Bi. The 
remaining chunks are organized in the same way in arrays Ai and bitmaps Bi, storing 
contiguously the i-th chunks of the numbers that have them. Note that arrays Ai 
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store contiguously the bits {i — 1) ■ b + 1, . . . ,i ■ b and bitmaps store whether a 
number has further chunks or not, hence the name Reordered Vbytes. 

Figure 2.1 shows an example of the resulting structure. The first element is 
represented with two blocks, thus, Ai[Q] = Ci,i, AsfO] = Ci,2, Bi[0] = 1 and EafO] = 0. 



Cl,l Ci,2 


C2,l 


Ca,! 6*3,2 Ca^a 


6*4,1 6*4,2 


C5,l 







Ci.i 


62.1 


C3,l 


64.1 


6r,,i 




51 


1 





1 


1 








61,2 


63.2 


64,2 







1 








63,3 










Figure 2.1: Example of Directly Addressable Codes structure 

To access the element at position i = ii we check whether is set. If it is 

not set, this is the last chunk and we already have the value C[i] = Ai[ii], otherwise 
we have to fetch the following chunks. In that case, we recompute the position as 
i2 — ranki{Bi,ii), where ranki{Bi,ii) is the number of ones up to position ii on 
bitmap Bi (see Section 2.5 for further details). If B2[i2] is not set we are done with 
C[i] = Ai[ii] + ^2(^2] ■ S**, otherwise we set 23 = ranki{B2,i2) and continue in the 
following levels. Accessing a random element takes 0(log(M)/6) worst case time, 
where M — maxCj. However, the access time is lower for elements with shorter 
codewords, which are usually the most frequent ones. 

We will use the implementation of Susana Ladra^ (available by personal request) 
in this thesis. 



2.5 Bitmaps 

Let B a binary sequence over S = {0, 1} (a bitmap) of length n and assume it has 
m ones. We are interested in solving the following operations: 

^Universidade da Coruna, Spain. sladraOudc . es 
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access{B, 20) = 1 
ranki{B, 20) = 11 
ranko{B, 20) = 9 



selecto{B, 9) 



selecti{B, 11) = 20 



Figure 2.2: Example of rank and select 



Variant 


Size 


Rank 


Select 


Clark 


n + o{n) 


0(1) 


0(1) 


RRR 


nHo{B) + o{n) 


0(1) 


0(1) 


esp 

recrank 
vcodc 


uHoIb) + o{n) 
1.44m log — + m + o{n) 
mlog(n/log n) + o(n) 


0(1) 

(log ^) 
O(log^n) 


0(1) 

(log ^) 
O(logn) 


sdarray 


m log ^ + 2m + o(m) 


flog + '"s'™) 

\ ^ m logn / 


Q ( log* m \ 
\ logn y 



Table 2.2: Complexities for binary rank and select 



• rankb{B,i): How many 6's are up to position i (included). 

• selectb{B,i): The position of the i-th b bit. 

Example 2.13. Figure 2.2 shows an example of the operations rank and select. We 
show the values of both ranki{B, 20) = 11 and ranko^B, 20) = 9. Note that these 
two values add up to 20, since the former returns the number of ones up to position 
20, and the latter the number of zeroes. Also, access simply returns the bit stored 
at that position, in our case at position 20 there is a 1. Finally, we show the value 
of s el act I {B , 11) = 20, which was expected since access{B, 20) — 1. The value of 
selecto{B,9) is 19. 



Several solutions have been proposed to address this problem. The first solution 
able to solve both kinds of queries in constant time uses n + 0( "'°^J°^" ) bits of space 
[Cla96]. Raman, Raman and Rao's solution (RRR) [RRR02] achieves nHo{B) + 
^( "'"ogn^*^ ) 1-*^^^ answer the queries in constant time. Okanohara and Sadakane 
[OS07] proposed several alternatives tailored to the case of small m (sparse bitmaps): 
esp, recrank, vcode, and sdarray. Table 2.2 shows the time and space complexities 
of these solutions. Note that the reported spaces include the representation of the 
bitmap. 
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2.5.1 Practical Dense Bitmaps 

The extra o{n) space of theoretical solutions [Cla96] is large in practice. Gonzalez 
et al. [GGMN05] proposed a solution with good results in practice and small space 
overhead (up to 5%). This implementation is very simple, yet its practical perfor- 
mance is better than classical solutions. They store the plain bitmap in an array B 
and have a table Rs where they store ranki{B,i ■ s), where s — 32k, where A; is a 
parameter for the frequency of the sampling of the bit vector. They use a function 
called popcount that counts the number of 1 bits in a word (4 bytes). This operation 
can be solved bit by bit, but it is easy to improve it, using either bit parallelism or 
precomputed tables, requiring thus just a few operations. They solve the operations 
as follows {ranko and selecto are obvious variations): 

• ranki{B,i): They start in the last entry of Rg that precedes i {Rs[\}/ s\]), 
and then sequentially scan the array B, popcounting consecutive words, until 
reaching the desired position. The popcounting of the last word is done by first 
setting all bits after position i to zero, which is done in constant time using a 
mask. Thus the time is 0{k). 

• selecti{B, i): They first binary search the Rg table for the last position p where 
Rs\p] < 'i- Then they scan B sequentially using popcount looking for the word 
where the desired select position is. Finally they find the desired position in 
the word by sequentially scanning the word bit by bit. Thus the time is 
0(A; + logf). 

We will use the implementation of Rodrigo Gonzalez (available at http://code. 
google . com/p/libcds) in this thesis. 

2.5.2 Practical Sparse Bitmaps 

When the bitmap is very sparse (i.e., the number of ones in the bitmap is very 
low) one practical solution is to 5-encode the distances between consecutive ones. 
Additionally we need to store absolute sample values selecti{B,i ■ s) for a sampling 
step s, plus pointers to the corresponding positions in the 5-encoded sequence. We 
solve the operations as follows: 

• selecti{B,i) is solved within 0{s) time by going to the last sampled position 
preceding i and decoding the 5-encoded sequence from there. 
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• ranki{B, i) is solved in time 0{s + log — ). First, we binary search the samples 
looking for the last sampled position such that selecti{B,i ■ s) < i. Start- 
ing from that position we sequentially decode the bitmap and stop as soon as 
selecti{B,p) > i. 

• access{B, i) is solved in time 0{s + log y ) in a way similar to rank. 

The space needed by the structure is + n/s([logmJ + 1 + [logVFj + 1), where 
W is the number of bits needed to represent all the 5-codes. In the worst case 
VF = 2mLlog(Llog^J +1)J + mLlog^J + m = mlog^ + 0(mloglog^). 

This structure allows a space-time trade-off related to s and also has the property 
that several operations cost 0(1) after solving others. For example, selecti{B,p) and 
selecti{B,p+ 1) cost 0(1) after solving p ranki{B,i). 

2.6 Wavelet Trees 

A wavelet tree [GGV03] is an elegant data structure that stores a sequence 5 of n 
symbols from an alphabet S of size a. This structure supports some basic queries 
and is easily extensible to support others. 

We split the alphabet into two halves L and R, so that the elements of L are 
lexicographically smaller than those of R. Then, we create a bitmap B of size n 
setting B[i] = if the symbol at position i belongs to L and B[i] = 1 otherwise. This 
bitmap is stored at the root of the tree. Afterward, we extract from S all symbols 
belonging to L, generating sequence Sl, and all symbols belonging to R, generating 
sequence Sr (these sequences are not stored). Finally, we recursively generate the 
left subtree on Sl and the right subtree on 5*^. We continue until we get a sequence 
over a one-letter alphabet. Figure 2.3 shows the wavelet tree for the example text 
alabar_a_la_alabarda. Only the bitmaps (black color) are stored in the tree. The 
labels of the tree show (gray color) the subsets L and R and the strings over the 
bitmaps (gray color) show the conceptual subsequences Sl and Sr. 

The resulting tree has a leaves, height [log cr] , and n bits per level. Thus the space 
occupancy is nlogcr bits, plus o(n log a) (more precisely, 0( "''°^^^°^'°^"' )) additional 
bits to support rank and select queries on the bitmaps. 

In the following we explain how this structure supports the operations access, 
rank and select on S. The last two operations are just a generalization for larger 
alphabets of those defined in Section 2.5. 
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a labar_a_la_a laba rda 
01000100010001 0001 1 




aabaaaaabaa 
001 000000001 00 
b 



bb 




aaa_a_a_aaaa 
111010101111 

a 



aaaaaaaaa d 




I r I I rd 
010010 




r r 



I I I 



Figure 2.3: Example of a wavelet tree for the text alabar_a_la_alabarda 

• Access: To retrieve the symbol S[i] we look at B[i] at the root. If it is a 

we go to left subtree, otherwise to the right subtree. The new position is 

1 •<— ranko{B, i) if we go to the left and i •<— ranki{B, i) if we go to the right. 
This procedure continues recursively until we reach a leaf. The bits read in the 
path from the root to the leaf represent the symbol sought. 

• Rank: To count how many c's are up to position i we go to the left if c is in 

L and otherwise to the right. The new position is i ranko{B,i) if wc go to 
the left and i ^ ranki{B, i) if wc go to the right, where B is the bitmap of the 
root. When we reach a leaf the answer is i. 

• Select: To find the i-th symbol c we first go to the leaf corresponding to c and 
then go upwards to the root. Let B the bitmap of the parent. If the current 
node is a left child then the position at the parent is i <(— selecto{B, i), otherwise 
it is i •<— selecti{B, i). When we reach the root the answer is the current i value. 

The running time of these operations is 0(log a) , since we use a bitmap supporting 
constant-time rank, select and access. 

Example 2.14. Figure 2.4 shows an example of how we retrieve the 11th symbol of 
sequence S {access{S, 11) = a). First we access the bitmap of the root and see that at 
position 11 there is a 0. Hence we descend to the left. Then using ranko{B, 11) = 8 
we count how many zeroes are up to position 11. This value is our new position in 
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Figure 2.4: Example of access in a wavelet tree 



the next level. Then we continue the process until we reach a leaf; in that case the 
symbol stored in that lead is the symbol sought, in our case an 'a' . 

Example 2.15. Figure 2.5 shows step by step how we compute ranA;i(S', 11) = 2. 
Since symbol ' 1 ' is mapped to a 1 we descend from the root to the right child. Using 
ranki{B, 11) = 3 we count the number of ones up to that position. This is our new 
position in the next level. Then we continue the process until we reach a leaf. The 
value sought is the last value of rank, in our case 2. 

Example 2.16. Figure 2.6 shows an example of how to select the second 'b' in 
the sequence S {selectb{S, 2)). First we descend to the leaf representing symbol 'b' . 
Since that symbol was last mapped to a 1, we go to the parent and compute our new 
position as selecti{B, 2) = 12. In that level, 'b^ was mapped to a 0, so we go to the 
parent and the new position is selecto{B, 12) = 16, and that is the value sought. 

2.6.1 Range Search 

A direct application of wavelet trees is to answer range search queries [MN07]. This 
method is very similar to the idea of Chazelle [Cha88] . 

Definition 2.17. Given a subset R {\R\ = t) of the discrete range [l,n] x [1,(t], a 
range query returns the points p & R belonging to a range [xi,a;2] x [2/1,^2]- 
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Figure 2.5: Example of rank in a wavelet tree 
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Figure 2.6: Example of select in a wavelet tree 
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Figure 2.7: Example of 2-dimensional range query 

An extension of the wavelet tree supports range queries using n + t log n + o{n + 
tlogn) bits, counting the number of points within the range in time O(logn) and 
reporting each occurrence in time O(logn). We will use a modified version of the 
implementation of Gonzalo Navarro^. 

We explain here a simplified version for the case in which there exists exactly 
one point for each value of x. We order the points of R by their x coordinate and 
create the sequence S'[1,?t,], such that for each {x,y) G R, S[x] = y. Then we build 
the wavelet tree of S. 

Example 2.18. Figure 2.7 shows a grid, with exactly one y value for each value of 
X. The figure shows in yellow the range [17, 19] x [9, 18], containing two occurrences; 
in red and yellow, the range [17, 19] x [1,21], containing 3 occurrences; and in green 
and yellow, the range [1,21] x [9, 18], containing 10 occurrences. 

^Chcck the LZ77-index source code (http://pizzachili.d.cc.uchile.cl/indexes/ 
LZ77- index) for the updated version 
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Projecting 

A range in S represents a range along the x coordinate and the sphts made by the 
wavelet tree define ranges along the y coordinate. Every time we descend to a child of 
a node we need to know where the range represented in that child is. The operation 
of determining the range defined by a child, given the range of the parent, is called 
projecting. Using rank we project a range downwards. Given a node with bitmap 
B the left projection of [x,x'] is [1 + ranko{B,x — l),ranko{B,x')] and the right 
projection is [l+ranki{B,x — l),ranki{B,x')]. A range [y,y'] along the y coordinate 
is projected to the left as [y, [{y + y')/'2\] and to the right as [[{y + y')/'2\ + l,y']. 

Counting 

We start from the root with the one-dimensional ranges [x, x'] = [xi, X2\ and [y, y'] ~ 
[l,n] and project them in both subtrees. We do this recursively until: 

1. [x,x']^$; 

2. [y,y']n[yi,y2] = ^; or 

3. [y, y'] Q [yi,y2], in which case we add x' — x + 1 to the total. 

As the interval [yi,y2\ is covered by O(logn) maximal wavelet tree nodes, the total 
time to count the occurrences is O(logn). 

Example 2.19. Figure 2.8 shows the wavelet tree that represents the range of Figure 
2.7. The figure represents how to count the occurrences in the range [17, 19] x [9, 18]. 
The figure shows in red how the range [17, 19] in the x coordinate is projected down- 
wards. The nodes below the blue fine are those whose y range is contained in the 
range [9, 18]. Additionally, the nodes with a circle next to them are those in which 
the counting process ends. The blue circle represents rule number 1 (see above), the 
green one represents rule number 2 and finally rule number 3 is represented by the red 
circle. In our case, at each node marked with red, we report one occurrence, yielding 
a total of 2 occurrences. 

Locating 

To locate the actual points we start from each node in which we were counting. If 
we want to know the x coordinate we go up using select and if we want to know the 
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Figure 2.8: Example of counting the occurrences in a 2-dimensional range query using 
a wavelet tree 



y coordinate we go down using rank. This operation takes O(logri) for each point 
located. 



2.7 Permutations 

A permutation is a bijection vr : [l,n] — t- [l,n], and we are interested in computing 
efficiently both 7r('i) and 7r~^(z) for any 1 < i < n. The permutation can be represented 
in a plain array using nlogn bits, by storing P = [vr(l), . . . , vr(r;,)]. This answers 7r{i) 
in constant time. Solving 7r~^(z) can be done by sequentially scanning P for the 
position j where 7c{j) = i. A more efficient solution [MRRR03] is based on the cycles 
of a permutation. A cycle is a sequence i, vr(z), vr^(i), . . . , 7!-^{i) such that n'^'^^^i) = i. 
Every i belongs to exactly one cycle. Then, to compute vr"^ we repeatedly apply vr 
over i, finding the element e of the cycle such that nle) = i. These solutions do not 
require any extra space to compute but they take 0{n) time in the worst case. 

Representing the sequence vr[l, n] with a wavelet tree one can answer both queries 
using O(logn) time and n log n + o(?t, log n) bits of space. A faster solution [MRRR03] 
is based on the cycles of the permutation. By introducing shortcuts in the cycles, it 
uses (1 + e)n\ogn + 0{n) bits and solves 7i{i) in constant time and 7r~^(z) in 0{l/e) 
time, for any e > 0. 
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We will use the implementation of Munro et a/.'s shortcut technique by Diego 
Arroyuelo^, available at http://code.google.eom/p/libcds. 

2.8 Tree Representations 

A classical representation of a general tree of n nodes requires 0{nw) bits of space, 
where w > logn is the bit length of a machine pointer. Typically only operations such 
as moving to the first child and to the next sibling, or to the i-th child, are supported 
in constant time. By further increasing the constant, some other simple operations are 
easily supported, such as moving to the parent, knowing the subtree size, or the depth 
of the node. However, the f2(nlogn) bit space complexity is excessive in terms of 
information theory. The number of different general trees of n nodes is Cn ~ 4"/n^/^, 
hence logC„ = 2n — ©(logn) bits arc sufficient to distinguish any one of them. 

There are several succinct tree representations that use 2n + o(n) bits of space and 
answer most queries in constant time (see the review by Arroyuelo et al. [ACNSIO] 
for a detailed exposition); here we explain the DFUDS [BDM+05] representation as 
this is the one that meets our requirements. 

Definition 2.20. A sequence S drawn from alphabet S = {0, 1} is said to be balanced 
if: (1) there are as many Os as Is and (2) at any position i the number of zeroes to 
the left is greater or equal than the number of ones (i.e., ranko{S,i) > ranki{S,i)). 
Usually a balanced sequence is referred as balanced parentheses by identifying as '(' 
and 1 as ')', as the nesting of parentheses satisfies the above definition. 

The operations defined over a balanced sequence are: (1) findclose(S,i) (find- 
open(S,i)) finds the matching 1 (O) of the (l) at position i, and (2) enclose(S,i) is 
the position of tightest enclosing node i. 

Definition 2.21 ([BDM+05]). The Depth-first unary degree sequence (DFUDS) is 
generated by a depth-first traversal of the tree, at each node appending the degree of 
the node in unary. Additionally a leading 1 is prepended to the sequence to make it 
balanced and allow the concatenation of several such encodings into a forest. 

The DFUDS sequence represents the topology of the tree using 2n bits. Tree nodes 
are identified in the DFUDS sequence according to their rank in the order given by the 
depth- first traversal (more precisely, the i-th node is identified by position selecti{i) 

^Yahoo! Research, Chile. darroyueSdcc.uchile.cl 
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in the DFUDS encoding). Figure 2.9 shows the DFUDS bit-sequence for the example 
tree. The red 1 in the sequence is the preceding 1 added to make the sequence 
balanced. The green node is represented by the 10th 1 in the sequence, as it is the 
10th node visited during a depth-first traversal. The blue sequence of five Is and one 
is the degree of the blue node. 




11110011111000011000011000 



Figure 2.9: Example of DFUDS representation 

To solve the common operations over trees two data structures are built over 
the DFUDS sequence: a bitmap data structure supporting rank and select (Section 
2.5) and a data structure solving operations findclose, findopen and enclose [Jac89, 
MROl, Nav09]. These structures allow one to compute the most common operations 
in constant time using o{n) additional bits of space. Additionally, if we use labeled 
trees we need to store the labels of the edges in an array chars, using n log cr additional 
bits, where a is the labels' alphabet size. The label of the edge pointing to the i-th 
child of node x is at chars[ranki{dfuds,x) + i]. The operations we are interested in 
for this thesis are: 

• degree(x): number of children of node x. 

• isLeaf(x): whether node x is a leaf. 

• child(x,i): i-th child of node x. 

• laheledChild(x ,c): child of node x labeled by symbol c. 

• leftmostLeaf(x): leftmost leaf of the subtree starting at node x. 

• rightmostLeaf(x): rightmost leaf of the subtree starting at node x. 

• leafRank(x): number of leaves to the left of node x. 
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• preorder(x): preorder position of node x. 

All these operations can be solved theoretically in constant time; however, in practice 
labeledChild is solved by binary searching the labels of the children, because it is much 
easier to implement and fast enough in practice. To solve leftmostLeaf, rightm,ostLeaf 
and leafRank we need to solve the queries rankooii) and selectooii). RankoQ{i) returns 
the number of occurrences of the substring 00 in the bitmap up to position i and 
selectooii) returns the position p of the i-th occurrence of the substring 00 in the 
bitmap. Solving these queries requires an additional data structure that uses o{n) 
bits. It uses the same ideas as the one for solving rank and select for binary alphabets. 

We will use a modified version of the implementation of Diego Arroyuelo available 
at http: //code. google. com/p/libcds, adding support for leaf-related operations. 

2.9 Tries 

A trie or digital tree is a data structure that stores a set of strings. It can find the 
elements of the set prefixed by a pattern in time proportional to the pattern length. 

Definition 2.22. A trie for a set S of distinct strings is a tree where each node 
represents a distinct prefix in the set. The root node represents the empty prefix e. 
A node v representing prefix y is a child of node u representing prefix X iSY — Xc 
for some character c, which labels the edge between u and v. 

We suppose that all strings are ended by a special symbol $, not present in the 
alphabet. We do this in order to ensure that no string is a prefix of some other 
string Sj. This property guarantees that the tree has exactly \S\ leaves. Figure 2.10 
shows an example of a trie. 

A trie for the set S — {Si, . . . , Sn} is easily built on Ody^il + . . . + time 
by successive insertions (assuming we can descend to any child in constant time). A 
pattern P is searched for in the trie starting from the root and following the edges 
labeled with the characters of P. This takes a total time of 0(|P|). 

A compact trie is an alternative representation that reduces the space of the 
trie by collapsing unary nodes into a single node and labeling the edge with the 
concatenation of all labels. A PATRICIA tree [Mor68], an alternative that uses even 
less space, just stores the first character of the label string and its length. This variant 
is used when the strings S are available separately, as not all information is stored 
in the edges. In this variant, after the search we need to check if the prefix found 



22 



2.10 SufEx Trees 



Chapter 2 Basic Concepts 




Figure 2.10: Example of a trie for the set S — {'alabar', 'a', 'la', 'alabarda'}. 

actually matches the pattern. For doing so, we have to extract the text corresponding 
to any string with the prefix found and compare it with the pattern. It they are equal 
then all leaves will be occurrences (i.e., strings prefixed with the pattern), otherwise 
none will be an occurrence. Figure 2.11 shows an example of this kind of trie. 



Figure 2.11: Example of a PATRICIA trie for the set S — 
{'alabar', 'a', 'la', 'alabarda'}. The values in parentheses are respectively 
the first character of the label and the length of the label. 

Definition 2.23. A suffix trie is a trie composed of all the suffixes T[i,n] of a given 
text T[l,n]. The leaves of the trie store the positions where the suffixes start. 



2.10 Suffix Trees 

Definition 2.24 ([Wei73, McC76]). A suffix tree is a PATRICIA tree built over all 



the suffixes T[i,n] of a given text T[l,n]. The leaves in the tree indicate the text 
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positions where the corresponding suffixes start. 

Figure 2.12 shows the suffix tree for the text 'alabar_a_la_alabarda$'. 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

T = alabar_a_la_alabarda$ 



Figure 2.12: The suffix tree for the text 'alabar_a_la_alabarda$^ 



The suffix tree can be built in 0{n) time using 0(n log n) bits of space [McC76, 
Ukk95]. 

A suffix tree is able to find all the occ occurrences of a pattern P of length m 
in time 0(m + occ), i.e., to solve the locate query described in Section 2.2. After 
descending by the tree according to the characters of the pattern, we could be in 
three different cases: i) we reach a point in which there is no edge labeled with the 
current character of P, which means that the pattern does not occur in T; ii) we 
finish reading P in an internal node (or in the middle of an edge), in which case the 
suffixes of the corresponding subtree are either all occurrences or none, therefore we 
only need to check if one of those suffixes matches the pattern P; iii) we end up in a 
leaf without consuming all the pattern, in which case at most one occurrence is found 
after checking the suffix with the pattern. As a subtree with occ leaves has 0{occ) 
nodes, the total time for reporting the occurrences is as stated above. 
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The suffix tree can solve the queries count and exists in 0{m) time. The process 
is similar to that of locate. First we descend the tree according to the pattern. Then, 
we check if one of the suffixes of the subtree is a match. If it is a match the answer 
of count is the number of leaves of the subtree (for which we need to store in each 
internal node the number of leaves that descend from it), otherwise it is zero. 

2.11 Suffix Arrays 

Definition 2.25 ([MM93, GBYS92]). A suffix array A[l,n] is a permutation of the 
integer interval holding T[74[i],n] < T[A[i + l],n] for all 1 < i < n. In other 

words, it is a permutation of the suffixes of the text such that the suffixes are lexico- 
graphically sorted. 

Figure 2.13 shows the suffix array for the text ' alabar_a_la_alabarda$ ' . The 
character $ is the smallest one in lexicographical order. The zone highlighted in gray 
represents those suffixes starting with ' a' . 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

T = a laba r_a_ la_a laba rda$ 



Figure 2.13: The suffix array for the text ' alabar_a_la_alabarda$ ' 

Note that the suffix array could be computed by collecting the values at the leaves 
of the suffix tree. However, several methods exist that compute the suffix array in 
0(n) or 0{n\ogn) time, using significantly less space. For a complete survey see 
[PST07]. 

The suffix array can solve locate queries in 0(mlogn + occ) time, and count and 
exists queries in 0(m log n) time. First, we search for the interval A[spi,epi] of the 
suffixes starting with P[l]. This can be done via two binary searches on A. The first 
binary search determines the starting position sp for the suffixes lexicographically 
larger than or equal to P[l], and the second determines the ending position ep for 
suffixes that start with P[l]. Then, we consider P[2], narrowing the interval to 
A[sp2,ep2], holding all suffixes starting with P[l,2]. This process continues until P 
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is fully consumed or the current interval becomes empty. Note that this algorithm 
searches for the pattern from left to right. For each character of the pattern, we do two 
binary searches taking at most time 0(log n) , hence the total time is 0{m log n) . Then 
locate reports all occurrences in 0{occ) time and the answer to count is epm — spm + ^- 
We can also directly search for the interval A[sp, ep] where the suffixes start with the 
pattern P using just two binary searches on A, which find the first and last position 
where the suffixes start with P. Each comparison between the pattern and a suffix 
will take at most 0{m) time, hence the total running time is also O(mlogn). Yet, 
this is faster in practice than the previous method and is what we use in this thesis. 

2.12 Backward Search 

Backward search is an alternative method for finding the interval [sp, ep] correspond- 
ing to a pattern P in the suffix array. It searches for the pattern from right to left, 
and is based on the Burrows- Wheeler transform. 

Definition 2.26 ([BW94]). Given a text T terminated with the special character 
T[n] — $ smaller than all others, and its suffix array 74[l,n], the Burrows-Wheeler 
transform (BWT) of T is defined as T''^*[i] = T[A[i\ - 1], except when A[i\ = 1, 
where T^"'*[i] = T[n]. In other words, the transformation is conceptually built first 
by generating all the cyclic shifts of the text, then sorting them lexicographically, and 
finally taking the last character of each shift. In practice it can be built in linear time 
by building the suffix array first. 

We can think of the sorted list of cyclic shift as a conceptual matrix M[l, n] [1, n]. 
Figure 2.14 shows an example of how the BWT is computed for the text 
'alabar_a_la_alabarda$' . This transformation has the advantage of being easily 
compressed by local compressors [ManOl]. It can be reversed as follows. 

Definition 2.27. The LF-mapping LF{i) maps a position i in the last column of M 
(L = T^*) to its occurrence in the first column of M (F). 

Lemma 2.28 ([FM05]). It holds 

LF{i) = C[c] + rankc{T^\i) 
where c = T'^^^[i] and C[c] is the number of symbols smaller than c in T. 

Lemma 2.29 ([BW94]). The LF-mapping allows one to reverse the Burrows- Wheeler 
transform. 
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alabar_a_la_alabarda$ 
labar_a_la_alabarda$a 
abar_a_la_alabarda$al 
bar_a_la_alabarda$ala 
ar_a_la_alabarda$alab 
r_a_la_alabarda$;5laba 
_a_la_alabarda$alabar 
a_la_alabarda$alabar_ 
_la_alabarda$alabar_a 
la_alabardaSalabar_a_ 
a_alabarda$alabar_a_l 
_alabarda$alabar_a_la 
alabarda$alabar_a_la_ 
labarda$alabar_a_la_a 
abarda$alabar_a_la_al 
barda$alabar_a_la_ala 
ardaSalabar_a_la_alab 
rda$alabar_a_la_alaba 
da$alabar_a_la_alabar 
a$alabar_a_la_alabard 
$alabar_a_la_alabarda 



sort 



$alabar_a_la_alabarda 
_a_la_alabarda$alabar 
_alabarda$alabar_a_la 
_la_alabarda$alabar_a 
a$alabar_a_la_alabard 
a_alabarda$alabar_a_l 
a_la_alabarda$alabar_ 
abar_a_la_alabarda$al 
abarda$alabar_a_la_al 
alabar_aja_alabarda$ 
alabarda$alabar_a_la_ 
ar_a_la_alabarda$alab 
arda$alabar_a_la_alab 
bar_aja_alabarda$ala 
barda$alabar_a_la_ala 
da$alabar_a_la_alabar 
la_alabarda$alabar_a_ 
labar_a_la_alabarda$a 
labarda$alabar_a_la_a 
r_a_la_alabarda$alaba 
rda$alabar_aja_alaba 



$alabar_a_la_alabarda 
_a_la_alabarda$alabar 
_alabarda$alabar_a_la 
Ja_alabarda$alabar_a 
a$alabar_a_la_alabard 
a_alabarda$alabar_a_l 
a_la_alabarda$alabar_ 
abar_a_la_alabarda$al 
abarda$alabar_a_la_al 
alabar_a_la_alabarda$ 
alabarda$alabar_a_la_ 
ar_a_la_alabarda$alab 
arda$alabar_a_la_alab 
bar_a_la_alabardaSala 
barda$alabar_a_la_ala 
da$alabar_a_la_alabar 
la_alabarda$alabar_a_ 
labar_a_la_alabarda$a 
labarda$alabar_a_la_a 
r_a_la_alabarda$alaba 
rda$alabar a la alaba 



= M 



Figure 2.14: The BWT of the text ' alabar_a_la_alabarda$ ' 

Proof. We know that T[n] = $ and since $ is the smallest symbol, T[n] = F[l] = $ 
and thus T[n — 1] = L[l] = T^"'*[l]. Using the LF-mapping we compute i = LF{1)] 
knowing that T[n — 1] is at F[i], we have T[n — 2] = L[i], as L[i] always precedes F[i] 
in T. In general, it holds T[n - k] = T*"'*[LF'=-^(1)]. □ 

Given the close relation between the suffix array and the BWT, it is natural to 
expect that a search algorithm can work on top of the BWT. Such algorithm is 
called backward search (BWS), and at each stage it narrows the interval [spi,epi] of 
the suffix array in which the suffixes start with P[i,m], starting from i = m and 
ending with i = 1. Narrowing the interval A[sp, ep] with a new character c is called a 
BWS{sp,ep,c) step and it is done very similarly to the LF-mapping (Lemma 2.28). 
BWS searches a pattern from right to left, opposite to the search on suffix arrays, 
that searches for a pattern from left to right. 

Figure 2.15 shows the backward search algorithm. Lines 5-7 correspond to the 
BWS step. 
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BWS(P) 

1 i len{P) 

2 sp 1 

3 ep n 

4 while sp < ep and i > 1 do 

5 c ^ P[i] 

6 sp ^ C[c] + rankc{T^\ sp - 1) + 1 

7 ep ^ C[c] + rankc{T^\ ep) 

8 i ^ i - 1 

9 if sp > ep then 

10 return 

11 return (sp, ep) 

Figure 2.15: Backward Search algorithm (BWS) 

2.13 Lempel-Ziv Parsings and Repetitions 

Lempel and Ziv proposed in the seventies a new compression method [LZ76, ZL77, 
ZL78]. The basic idea is to replace a repeated portion of the text with a pointer 
to some previous occurrence of that portion. To find the repetitions they keep a 
dictionary representing all the portions that can be copied. Many variants of these 
algorithms exist [SS82, Wel84, Wil91] which differ in the way they parse the text or 
the encoding they use. 

The LZ77 [ZL77] parsing is a dictionary-based compression scheme in which the 
dictionary used is the set of substrings of the preceding text. This definition allows 
it to get one of the best compression ratios for repetitive texts. 

Definition 2.30 ([ZL77]). The LZll parsing of text T[l,n] is a sequence Z[l,n'] of 
phrases such that T = Z[1]Z[2] . . . Z[n'], built as follows. Assume we have already 
processed T[l,i — 1] producing the sequence Z[l,p — 1]. Then, we find the longest 
prefix T[i, i' — 1] of T[i, n] which occurs in T[l, i — 1],^ set Z\p\ — T[i, i'] and continue 
with i = i' + 1. The occurrence in T[l,i — 1] of the prefix T[i,i' — 1] is called the 
source of the phrase Z[p]. 

Note that each phrase is composed of the content of a source, which can be the 
empty string s, plus a trailing character. Note also that all phrases of the parsing 

^The original definition allows the source of T[i,i' — 1] to extend beyond position i — 1, but we 
ignore this feature in this thesis. 
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are different, except possibly the last one. To avoid that special character $ 

is appended at the end, T[n] = $. 

Typically a phrase is represented as a triple Z\p] = {start, len,c), where start is 
the start position of the source, len is the length of the source and c is the trailing 
character. 

Example 2.31. Let T — 'alabar_a_la_alabarda$'; the LZ77 parsing is as follows: 



a 


1 


ab 


ar 




a 


la 


alabard 


a$ 



In this example the seventh phrase copies two characters starting at position 2 and 
has a trailing character 

One of the greatest advantages of this algorithm is the simple and fast scheme 
of decompression, opposed to the construction algorithm which is more complicated. 
Decompression runs in linear time by copying the source content referenced by each 
phrase and then the trailing character. However, random text extraction is not as 
easy. 

The LZ78 [ZL78] compression scheme is also dictionary-based. Its dictionary is 
the set of all phrases previously produced. Because of this definition of the dictionary 
the construction process is much simpler than that of LZ77. 

Definition 2.32 ([ZL78]). The LZ78 parsing of text T[l,n] is a sequence Z[l,n'] of 
phrases such that T = Z[1]Z[2] . . . Z[n'], built as follows. Assume we have already 
processed T[l,i — 1] producing the sequence Z[l,p — 1]. Then, we find the longest 
phrase Z\j], for j < p — 1, that is a prefix of set Z\p\ = Z[j]T[i + \Z\j]\] and 

continue with i — i + \Z[j] | + 1. 

Typically a phrase is represented as Z [p] — [j, c) , where j is the phrase number 
of the source and c is the trailing character. 

Example 2.33. Let T — ' alabar_a_la_alabarda$ ' ; the LZ78 parsing is as follows: 



a 


1 


ab 


ar 




a_ 


la 


_a 


lab 


ard 


a$ 



In this example the ninth phrase copies two characters starting at position 2 and has 
a trailing character 'b'. 

With respect to compression, both LZ77 and LZ78 converge to the entropy of 
stationary ergodic sources [LZ76, ZL78]. It also converges below the empirical entropy 
(Section 2.3), as detailed next. 
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Definition 2.34 ([KM99]). A parsing algorithm is said to be coarsely optimal if 
its compression ratio p{T) differs from tfie A;-tfi order empirical entropy Hk{T) by 
a quantity depending only on the length of the text and that goes to zero as the 
length increases. That is, \/k3fk,\imn-^^ fk{n) — 0, such that for every text T, 
p{T)<H,{T) + M\T\). 

Theorem 2.35 ([KM99, PWZ92]). The LZ77 and LZ78 parsings are coarsely opti- 
mal. 

As explained in Section 2.3, however, converging to Hk{T) is not sufficiently good 
for repetitive texts. Repetitive texts are originated in applications where many similar 
versions of one base text are generated (i.e., DNA sequences); or where successive 
versions, each one similar to the preceding one (i.e., wiki), are generated. Statistical 
compressors are not able to capture this characteristic, because they predict a symbol 
based only on a short previous context, and such statistics do not change when the 
text is replicated many times (see Section 2.3 for the relation between Hk{T) and 
Hk{TT)). Compressors based on repetitions, such as Lempel-Ziv parsings or grammar 
based ones, do exploit this repetitiveness. 

2.14 Self-Indexes 

Definition 2.36. A self-index [NM07] is an index that uses space proportional to 
that of the compressed text and solves the queries locate and extract. As this kind of 
indexes can reproduce any text substring, they replace the original text. Additionally, 
some indexes provide more efficient ways of computing exists and count queries. 

There are several general-purpose self-indexes, however most of them do not 
achieve high compression for repetitive texts, as they are only able to compress up 
to the A;-th order empirical entropy (Section 2.3). Most are based on the BWT or 
suffix array (see [NM07] for a complete survey). In the last years some self-indexes 
oriented to repetitive texts have been proposed. We cover these now. 

2.14.1 Run-Length Compressed Suffix Arrays (RLCSAs) 

The Run-Length Compressed Suffix Array (RLCSA) [SVMN08] is based on the Com- 
pressed Suffix Array of Sadakane [Sad03] . This is built around the so called ^ func- 
tion. 
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Definition 2.37 ([GV05]). Let A[l, . . . ,n] be the suffix array of a text T. Then ^(i) 
is defined as 

=A-ip[i] mod n) + l] 

The \1/ function is the inverse of the LF mapping. maps suffix T[A[i], n] to suffix 
T[yl[i] + l,n], allowing one to scan the text from left to right. A run in the ^ array 
is an interval [a, b] for which it holds Vi G [a, 6 — 1], + 1) = \E'(i) + 1. 

In the RLCSA, one run-length encodes the differences \E'[i] — \E'[i — 1] and store 
absolute samples of the array This structure is very fast for count and exists 
queries. Its major drawback is the sampling it requires for locate and extract queries, 
as it takes {n\ogn)/s extra bits to achieve locating time 0{s), and time 0{s -\-r — I) 
for extract{l , r) , where s is the sampling step. 

The number of runs may be much smaller than nHj^{T) (for example runs{T) — 
runs{TT), whereas \TT\Hk{TT) > 2\T\Hk{T) as shown in Section 2.3). However, the 
difference between the number of runs and the number of phrases in an LZ77 parsing 
[ZL77] may be a multiplicative factor as high as Q{^/n).^ For these reasons, the 
RLCSA seems to be an intermediate solution between LZ77 and empirical-entropy- 
based indexes. 

2.14.2 Indexes based on sparse sufRx arrays 

In this section we present two indexes [KU96b, KU96a] by Karkkainen and Ukko- 
nen. Although these are not self-indexes, they set the ground for several self-indexes 
proposed later. 

• First, they choose some indexing positions of the text. These can be evenly 
spaced points [KU96b] or the points defined by a Lempel-Ziv parsing [KU96a]. 

• The suffixes starting at those points are indexed in a suffix trie, and the reversed 
prefixes in another trie. 

• The index in principle only allows one to find occurrences crossing an indexing 
point. 

• To find a pattern P of length m, they partition it in all m -|- 1 combinations of 
prefix and suffix. 

^Veli Makinen, personal communication 
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• For each partition, they search for the suffix in the suffix trie and for the prefix 
of the pattern in the reverse prefix trie. 

• The previous searches define a 2-dimensional range in a grid that relates each 
indexed text prefix (in lexicographic order) with the text suffix that follows (in 
lexicographic order). That is, related prefixes and suffixes are consecutive in 
the text. 

• A data structure supporting 2-dimensional range queries [Cha88] , finds all pairs 
of related suffixes and prefixes, finding in this way the actual occurrences. 

• Additionally, using a Lempel-Ziv parsing they are able to find all the occurrences 
of the pattern. The occurrences are either found in the grid by the process 
described above (primary occurrences), or by considering the copies detected 
by the parsing (secondary occurrences), for which an additional method tracking 
the copies finds the remaining occurrences. 

All following indexes can be thought as heirs of this general idea, which was 
improved by replacing or adding additional compact data structures to decrease the 
space usage. In most cases, the parsing was restricted only to LZ78 (Section 2.14.3), 
since it simplifies the index, and in others to text grammars (SLPs, Section 2.14.4). 
In the following two subsections we list the results obtained in those cases. This 
thesis can also be thought as a heir of this fundamental scheme: For the ffist time 
compact data structures supporting the LZ77 parsing have been developed in this 
thesis, which show better performance on repetitive texts. 

2.14.3 LZ78-based Self-Indexes 

In this section we present the space and running times of two indexes based on LZ78. 
Although they offer decent upper bounds and competitive performance on typical 
texts, experiments [SVMN08] have demonstrated that LZ78 is too weak to profit from 
highly repetitive texts. There are other such self-indexes [FM05], not implemented 
as far as we know. 

Arroyuelo et aZ.'s LZ-Index 

Navarro's LZ-Index [Nav04] is the first self-index based on the LZ78 parsing us- 
ing 0{nHk{T)) bits of space (it is also the first implemented in practice). It uses 
4n' log n'{l + o(l)) bits and takes 0{m^ log a+ {111 + occ) log n') time to locate the occ 
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occurrences of a pattern of length m, where a is the size of the alphabet, and n' is 
the number of phrases of the parsing. 

Arroyuelo et al. later improved the time and space of the index, achieving 
(2 + e)n' logn'(l + o(l)) bits and 0(m^ + (m + occ) logn') locate time [ANSIO], or 
(3 + e)n' logn'(l + o(l)) bits and 0((m + occ) logn') locate time [AN07]. 

Russo and Oliveira's ILZI 

Russo and Oliveira present a self-index based on the so-called maximal parsing, called 
ILZI [RO08]. 

Definition 2.38 ([RO08]). Given a suffix trie T (of a set of strings), the T -maximal 
parsing of string T is the sequence of nodes f i, . . . , such that T — Vi . . .Vf and, for 
every j, Vj is the largest prefix of Vj . . .Vf that is a node of T. 

First, they compute the LZ78 parsing of T^''^'", and then generate a suffix tree Tjs 
over the set of the reverse phrases. Next they build the maximal parsing of T using 
Tts- This parsing improves the compression of LZ78, as shown by the following 
lemma. 

Lemma 2.39 ([RO08]). If the number of phrases of the LZ78 parsing ofT is n' then 
the T7^-maximal parsing ofT has at most n' phrases. 

Their index uses at most 5n' logn'(l + o(l)) bits and takes 0{{m + occ) logn') time 
to locate the occ occurrences of a pattern of length m {n' is the number of blocks of 
the maximal parsing). 

2.14: A Straight Line Programs (SLPs) 

Claude and Navarro [CN09] proposed a self-index based on straight-line programs 
(SLPs). SLPs are grammars in which the rules are either — > a e E or — > 
XiXr, for l,r < i. The LZ78 [ZL78] parsing may produce an output exponentially 
larger than the smallest SLP. However, the LZ77 [ZL77] parsing outperforms the 
smallest SLP [CLL+05]. On the other hand producing the smallest SLP is an NP- 
complete problem [Ryt03, CLL+05]. However, Rytter [Ryt03] has shown how to 
generate in linear time a grammar using 0{£\ogi) rules and height 0{\ogi), where 
i is the size of the LZ77 parsing. Again, SLPs are intermediate between LZ77 and 
other methods. 
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The index [CN09] uses n' log n + 0(n' log n') bits of space, where n' is the number 
of rules of the grammar. It solves extract{l, r) in 0((r — / + h) log n') time and locate 
in 0{{m{m + h) + h - occ) logn') time, where h is the height of the derivation tree of 
the grammar and m the length of the pattern. 

Claude et al. [CFMPNIO] evaluated a practical implementation using the grammar 
produced by Rc-Pair [LMOO]. The results are competitive with the RLCSA only for 
extremely repetitive texts and short patterns. 
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Chapter 3 

A Repetitive Corpus Testbed 



In this chapter we present a corpus of repetitive texts. These texts are categorized 
according to the source they come from into the following: Artificial Texts, Pseudo- 
Real Texts and Real Texts. The main goal of this collection is to serve as a standard 
testbed for benchmarking algorithms oriented to repetitive texts. The corpus can be 
downloaded from http : //pizzachili . dec . uchile . cl/repcorpus . html. 



3.1 Artificial Texts 

This subset is composed of highly repetitive texts that do not come from any real-life 
source, but are artificially generated through some mathematical definition and have 
interesting combinatorial properties. 

3.1.1 Fibonacci Sequence 

This sequence is defined by the recurrence 

Fi = 
F2 = 1 

Fn = Fn-iFn-2 (3.1) 

The length of the string F„ is the Fibonacci number and the sequence is a sturmian 
word [Lot02], which means it has i + 1 different substrings (factors) of length i. 
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3.1.2 Thue-Morse Sequence (T^) 

This sequence [AS99] is defined by the recurrence 

Ti = 

Tn = Tn-lTn-l (3.2) 

where F is the bitwise negation operator (i.e., all get converted to 1 and all 1 to 
0). Because of the construction scheme of this sequence, there are many substrings 
of the form XX, for any string X. However, there are no overlapping squares, i.e., 
substrings of the form 0X0X0 or IXIXI. Furthermore, this sequence is strongly 
cube-free, i.e., there are no substrings of the form XXx, where x is the first character 
of the string X. Another interesting property of this string is that it is recurrent. 
That is, given any finite substring w of length n, there is some length (often much 
longer than n) such that w is contained in every substring of length n^. The length 
of these strings is |Tji| = 2". 



3.1.3 Run- Rich String Sequence (Rn) 

A measure of string complexity, related to the regularities of the text and strongly 
related to the LZ77 parsing [KK99] , is the number of runs. 

Definition 3.1. A period of string T[l,n] is a positive integer p holding that VI < 
i < n — p, T[i] = T[i + p]. A string is said to be periodic if its minimum period p is 
such that p < n/2. 

Definition 3.2 ([Mai89]). The substring T[i,j] is a run in a string T iff T[i,j] is 
periodic and T[i,j] is not extendable to the right (j = n or T[j + 1] 7^ T[j — p + 1]) 
or left (i = 1 or T[i - 1] ^ T[i + p - 1]). 

The higher the number of runs in a string, the more regularities it has. 

It has been shown that the maximum number of runs in a string is greater than 
0.944n [MKI+08] and lower than 1.029n [CIT08]. Pranek et al. [FSS03] show a 

constructive and simple way to obtain strings with many runs; the n-th of those 
strings is denoted Rn- The ratio of the runs of their strings to the length approaches 
3/(1 + \/5) = 0.92705.... 
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3.2 Pseudo-Real Texts 

Here we present a set of texts that were generated by artificially adding repetitiveness 
to real texts, thus we call them pseudo-real texts. 

To generate the texts, we took a prefix of IMiB of all texts of Pizza&Chili Corpus^, 
we mutated them, and we concatenated all of them in the order they were generated. 
Our mutations take a random character position and change it to a random character 
different from the original one. 

We used two different schemes for the mutations. The first one, denoted by a 
generates different mutations of the first text. The second, denoted by a 
mutates the last text generated. The second scheme resembles the changes obtained 
through time in a software project or the versions of a document, while the first 
scheme produces changes analogous to the ones found in a collection of related DNA 
sequences. 

The mutation rate, i.e., percentage of mutated characters, was set to 0.1%, 0.01% 
and 0.001%. 

The base texts (all from the Pizza&Chih corpus) we mutated were the following: 

• Sources: This file is formed by C/Java source code obtained by concatenating all 
the .c, .h, .C and .Java files of the linux-2.6.11.6 and gcc-4.0.0 distributions. 

• Pitches: This file is a sequence of midi pitch values (bytes in 0-127, plus a few 
extra special values) obtained from a myriad of MIDI files freely available on 
Internet. 

• Proteins: This file is a sequence of newline-separated protein sequences obtained 
from the Swissprot database. 

• DNA: This file is a sequence of newline-separated gene DNA sequences obtained 
from files OlhgplO to 21hgpl0, plus OxhgplO and OyhgplO, from Gutenberg 
Project. 

• English: This file is the concatenation of English text files selected from etext02 
to etext05 collections of Gutenberg Project. 

• XML: This file is an XML that provides bibliographic information on major 
computer science journals and proceedings and it was obtained from http: 
// dblp . uni-trier . de. 

^http : //pizzachili . dec . uchile . cl 
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3.3 Real Texts 

This subset is composed of texts coming from real repetitive sources. These sources 
are DNA, Wikipedia Articles, Source Code, and Documents. 

For the case of DNA we concatenated the texts in random order. For the others, 
we concatenated the texts according to the date they were created, from oldest to 
newest. 

3.3.1 DNA 

Our DNA texts come from different sources. 

• The Saccharomyces Genome Resequencing Project^ provides two text collec- 
tions: para, which contains 36 sequences of Saccharomyces Paradoxus and cere, 
which contains 37 sequences of Saccharomyces Cerevisiae. 

• From the National Center for Biotechnology Information (NCBl)'^ wc collected 
some DNA sequences of the same bacteria. The species we collected arc Es- 
cherichia Coli (23), Salmonella Enterica (15), Staphylococcus Aureus (14), Strep- 
tococcus Pyogenes (13), Streptococcus Pneumoniae (11) and Clostridium Bo- 
tulium (10). We wrote in parentheses the total number of sequences of each 
collection. Wc chose these species as they were the only ones with 10 or more 
different sequences. 

• A collection composed of 78,041 sequences of Haemophilus Influenzae^, also 
coming from the NCBI. 

Remcirk 3.3. Although there are four bases {A, C, G, T}, DNA sequences may have 

alphabets of size up to 16 = 2^ because some characters denote an unknown choice 
among the four bases. The most common character used is N, which denotes a totally 
unknown symbol. 

^http : //www . Sanger . ac . uk/Teams/TeamTl/durbin/sgrp 
^http : //www . ncbi . nlm . nih . gov 

*f tp : / /ftp . ncbi . nih . gov/genomes/INFLUENZA/inf luenza . f na . gz 
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3.3.2 Wikipedia Articles 

We downloaded all versions of three Wikipedia articles, Albert Einstein, Alan Turing 
and Nobel Prize. We downloaded them in English (denoted en) and German (denoted 
de). We chose these languages as they are among the most widely used on Internet 
and their alphabet may be represented using standard 1-byte encodings. The versions 
for all documents are up to January 12, 2010, except for the English article of Albert 
Einstein, which was downloaded only up to November 10, 2006 because of the massive 
number of versions it has. 

3.3.3 Source Code 

We collected all versions 5.x of the Coreutils^ package and removed all binary files, 
making a total of 9 versions. We also collected all 1.0.x and 1.1.x versions of the 
Linux Kernel^, making a total of 36 versions. 

3.3.4 Documents 

We took all pdf files of CIA World Leaders^ from January 2003 to December 2009, 
and converted them to text (using software pdftotext). 

3.4 Statistics 

To understand the characteristics of the texts present in the Repetitive Corpus, we 
provide below some statistics about them. The statistics presented are the following: 

• Alphabet Size: We give the alphabet size and the inverse probability of match- 
ing (IPM), which is the inverse of the probability that two characters chosen at 
random match. IPM is a measure of the effective alphabet size. On a uniformly 
distributed text, it is precisely the alphabet size. 

• Compression Ratio: Since we are dealing with compressed indexes it is use- 
ful to have an idea of the compressibility of the texts using general-purpose 

tp : //mirrors .kernel . org/gnu/coreutils 
tp : //f tp . kernel . org/pub/linux/kernel 
https: //www. cia.gov/library /publications/wor Id-leaders- 1 
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compressors. The following compressors are used: gzip* gives an idea of com- 
pressibility via dictionaries (an LZ77 parsing with limited window size); bzip2^ 
gives an idea of block-sorting compressibihty (using the BWT transform, Sec- 
tion 2.12); ppmdi^° gives an idea of partial-match-based compressors (related 
to the A:-th order entropy, Section 2.3); p7zip^^ gives an idea of LZ77 compres- 
sion with virtually unlimited window; and Re-Pair^^ [LMOO] gives an idea of 
grammar-based compression. All compressors were run with the highest com- 
pression options. 

• Empirical Entropy: Here we give the empirical entropy of the text with 
k ranging from to 8, measured as compression ratio. We also show, in paren- 
theses, the complexity function of T [Lot02] (or the number of conteocts) which 
count how many different substrings of a given size does T have. This is exactly 
our C{T, k) of Lemma 2.12. This measure has the following properties: 

C(T,1) = a 
C{T,n + m) < C{T,n)C{T,m) 

The lower this measure, the more repetitive the text is. For example, if C (T, n) ~ 
1 Vn, then T — c^ for some character c. When P(C, n) = n + 1 the sequence 
is said to be Sturmian (the Fibonacci sequence is an example of a Sturmian 
string) . 

Remark 3.4. The compression ratios are given as the percentage of the compressed 
file size over the uncompressed file size, assuming the original file uses one byte per 
character. This means that 25% compression can be achieved over a DNA sequence 
having an alphabet {A , C , G , T} by simply using 2 bits per symbol. As seen from the 
real-life examples given, these four symbols are usually predominant, so it is not hard 
to get very close to 25% on general DNA sequences as well. 

3.4.1 Artificial Texts 

Tables 3.1-3.3 give the statistics of artificial texts. 



°http : //www . gzip . org 
^http : / / www . bzip . org 

^'^http : //pizzachili . dec .uchile . cl/utils/ppmdi . tar . gz 
^^http : //www . 7-zip . org 

^^http : //www . cbrc . jp/~rwaii/en/restore .html 
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File 


Size 


S 


IPM 


F41 


256MiB 


2 


1.894 


T29 


256MiB 


2 


2.000 


Rl3 


207MiB 


2 


2.000 



Table 3.1: Alphabet statistics for Artificial Collection 



File 


p7zip 


bzip2 


gzip 


ppmdi 


Re-pair 


F41 


0.17624% 


0.00572%. 


0.46875% 


1.87500% 


0.00002% 


T29 


0.35896% 


0.01259% 


0.54688% 


2.18750% 


0.00004% 


Rl3 


0.17172% 


0.01227% 


0.53140% 


2.12560% 


0.00009% 



Table 3.2: Compression statistics for Artificial Collection 



File 


Ho 


Hi 


H2 


Ha 


Hi 


Hs 


H(i 


Hr 


Ha 


F41 


11.99% 

(1) 


7.41% 
(2) 


4.58% 
(3) 


4.58% 
(4) 


2.83% 
(5) 


2.83% 
(6) 


2.83% 
(7) 


1.75% 

(8) 


1.75% 

(9) 


T29 


12.50% 

(1) 


11.48% 
(2) 


8.34% 
(4) 


8.34% 
(6) 


4.16% 
(10) 


4.16% 
(12) 


4.16% 
(16) 


2.09% 
(20) 


2.09% 
(22) 


Rl3 


12.50% 
(1) 


9.85% 
(2) 


8.51% 
(4) 


6.55% 
(6) 


2.56% 
(8) 


2.33% 
(10) 


2.33% 
(12) 


2.33% 
(14) 


2.33% 
(16) 



Table 3.3: Empirical entropy statistics for Artificial Collection 
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3.4.2 Pseudo-Real Texts 

Tables 3.4-3.9 give the statistics of pseudo-real texts. 



File 


Size 


J] 


IPM 


Xml 0.001%^ 


lOOMiB 


89 


27.84 


Xml 0.01%i 


lOOMiB 


89 


27.84 


Xml 0.1%i 


lOOMiB 


89 


27.84 


DNA 0.001%^ 


lOOMiB 


5 


3.98 


DNA 0.01%i 


lOOMiB 


5 


3.98 


DNA 0.1%^ 


lOOMiB 


5 


3.98 


English 0.001%^ 


lOOMiB 


106 


15.65 


English 0.01%^ 


lOOMiB 


106 


15.65 


English 0.1%^ 


lOOMiB 


106 


15.65 


Pitches 0.001%^ 


lOOMiB 


73 


33.07 


Pitches 0.01%^ 


lOOMiB 


73 


33.07 


Pitches 0.1%^ 


lOOMiB 


73 


33.07 


Proteins 0.001%i 


lOOMiB 


21 


16.90 


Proteins 0.01%^ 


lOOMiB 


21 


16.90 


Proteins 0.1%^ 


lOOMiB 


21 


16.90 


Sources 0.001%^ 


lOOAliB 


98 


28.86 


Sources 0.01%i 


lOOMiB 


98 


28.86 


Sources 0.1%^ 


lOOMiB 


98 


28.86 



Table 3.4: Alphabet statistics for Pseudo-Real Collection (Scheme 1) 



File 


Size 


E 


IPM 


Xml 0.001%^ 


lOOMiB 


89 


27.84 


Xml 0.01%^ 


lOOMiB 


89 


27.84 


Xml 0.1%^ 


lOOMiB 


89 


27.86 


DNA 0.001%^ 


lOOMiB 


5 


3.98 


DNA 0.01%^ 


lOOMiB 


5 


3.98 


DNA 0.1%^ 


lOOMiB 


5 


3.98 


English 0.001%^ 


lOOMiB 


106 


15.65 


English 0.01%^ 


lOOMiB 


106 


15.66 


English 0.1%^ 


lOOMiB 


106 


15.74 


Pitches 0.001%^ 


lOOMiB 


73 


33.07 


Pitches 0.01%^ 


lOOMiB 


73 


33.07 


Pitches 0.1%^ 


lOOMiB 


73 


33.10 


Proteins 0.001%^ 


lOOMiB 


21 


16.90 


Proteins 0.01%^ 


lOOMiB 


21 


16.90 


Proteins 0.1%^ 


lOOMiB 


21 


16.92 


Sources 0.001%^ 


lOOMiB 


98 


28.86 


Sources 0.01%^ 


lOOMiB 


98 


28.86 


Sources 0.1%^ 


lOOMiB 


98 


28.92 



Table 3.5: Alphabet statistics for Pseudo-Real Collection (Scheme 2) 
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f lie 


pTzip 


t>zip2 


gzip 


ppmdi 




-A.ini u.uu±/o 


U. lO /o 


1 1 r\r\^ 

1 l.UU /O 


lO.UU /O 


o.ou /o 


u. ±y /o 


Yvini u.ui /o 


U- lo /o 


iZ.UU /o 


iO.UU /O 


o.uu /o 


U.4D /O 


yvmi u. i /o 


U.4D /o 


iz.uu /o 


io.UU /o 


4.1U/0 


z.uu/o 


DNA 001%-^ 


0.27% 


27 00% 


28 00% 


11.00% 


0.34% 


DNA 0.01%^ 


0.29% 


27.00% 


28.00% 


11.00% 


0.58% 


DNA 0.1%i 


0.51% 


27.00% 


28.00% 


12.00% 


2.50% 


English 0.001%^ 


0.31% 


28.00% 


37.00% 


22.00% 


0.39% 


English 0.01%^ 


0.35% 


28.00% 


37.00% 


22.00% 


0.65% 


English 0.1%^ 


0.59% 


28.00% 


37.00% 


22.00% 


2.70% 


Pitches 0.001%^ 


0.47% 


54.00% 


52.00% 


47.00% 


0.69% 


Pitches 0.01%^ 


0.50% 


54.00% 


52.00% 


47.00% 


0.95% 


Pitches 0.1%^ 


0.75% 


54.00% 


52.00% 


48.00% 


3.20% 


Proteins 0.001%^ 


0.32% 


41.00% 


39.00% 


31.00% 


0.42% 


Proteins 0.01%^ 


0.35% 


41.00% 


39.00% 


31.00% 


0.68% 


Proteins 0.1%^ 


0.59% 


41.00% 


39.00% 


32.00% 


2.70% 


Sources 0.001%^ 


0.20% 


19.00% 


25.00% 


12.00% 


0.28% 


Sources 0.01%^ 


0.23% 


19.00% 


25.00% 


12.00% 


0.56% 


Sources 0.1%^ 


0.50% 


20.00% 


25.00% 


13.00% 


2.60% 



Table 3.6: Compression statistics for Pseudo-Real Collection (Scheme 1) 



File 


pTzip 


bzip2 


gzip 


ppmdi 


Re-Pair 


Xml 0.001%^ 


0.15% 


12.00% 


18.00% 


3.50% 


0.18% 


Xml 0.01%^ 


0.18% 


14.00% 


19.00% 


4.40% 


0.39% 


Xml 0.1%^ 


0.39% 


25.00% 


29.00% 


17.00% 


2.20% 


DNA 0.001%^ 


0.26% 


27.00% 


28.00% 


11.00% 


0.33% 


DNA 0.01%^ 


0.29% 


27.00% 


28.00% 


11.00% 


0.52% 


DNA 0.1%^ 


0.46% 


27.00% 


28.00% 


13.00% 


2.20% 


English 0.001%^ 


0.31% 


28.00% 


37.00% 


22.00% 


0.38% 


English 0.01%^ 


0.34% 


29.00% 


37.00% 


23.00% 


0.59% 


English 0.1%^ 


0.55% 


38.00% 


43.00% 


31.00% 


2.50% 


Pitches 0.001%^ 


0.46% 


54.00% 


52.00% 


47.00% 


0.68% 


Pitches 0.01%^ 


0.49% 


54.00% 


53.00% 


48.00% 


0.89% 


Pitches 0.1%^ 


0.71% 


59.00% 


57.00% 


52.00% 


2.80% 


Proteins 0.001%^ 


0.31% 


41.00% 


39.00% 


32.00% 


0.41% 


Proteins 0.01%^ 


0.34% 


42.00% 


40.00% 


33.00% 


0.62% 


Proteins 0.1%^ 


0.54% 


47.00% 


46.00% 


40.00% 


2.50% 


Sources 0.001%^ 


0.20% 


20.00% 


25.00% 


13.00% 


0.27% 


Sources 0.01%^ 


0.23% 


21.00% 


26.00% 


14.00% 


0.49% 


Sources 0.1%^ 


0.44% 


34.00% 


35.00% 


26.00% 


2.50% 



Table 3.7: Compression statistics for Pseudo-Real Collection (Scheme 2) 
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File 


Ih) 


Hi 




11>, 


lU 


Ih, 


Hi, 


II7 


Ih 


Xml 


65.25% 


38.63% 


21.00% 


12.50% 


8.13% 


6.00% 


5.25% 


4.75% 


4.13% 


0.001%! 


(1) 


(89) 


(3325) 


(20560) 


(56120) 


(98084) 


(134897) 


(168846) 


(200451) 


Xml 


65.25% 


38.63% 


21.00% 


12.50% 


8.13% 


6.00% 


5.25% 


4.75% 


4.13% 


0.01%l 


(1) 


(89) 


(4135) 


(30975) 


(79379) 


(131811) 


(177924) 


(220923) 


(261651) 


Xml 


65.25% 


38.75% 


21.25% 


12.75% 


8.25% 


6.13% 


5.38% 


4.88% 


4.25% 


0.1%i 


(1) 


(89) 


(5251) 


(67479) 


(196554) 


(326296) 


(440199) 


(550570) 


(661284) 


DNA 


25.00% 


24.25% 


24.13% 


24.00% 


24.00% 


23.75% 


23.50% 


22.88% 


21.25% 


0.001%! 


(1) 


(5) 


(18) 


(67) 


(260) 


(1029) 


(4102) 


(16349) 


(62437) 


DNA 


25.00% 


24.25% 


24.13% 


24.00% 


24.00% 


23.75% 


23.50% 


22.88% 


21.25% 


0.01%! 


(1) 


(5) 


(18) 


(67) 


(260) 


(1029) 


(4102) 


(16368) 


(63204) 


DNA 


25.00% 


24.25% 


24.13% 


24.00% 


24.00% 


23.75% 


23.50% 


22.88% 


21.38% 


0.1%i 


(1) 


(5) 


(19) 


(70) 


(264) 


(1034) 


(4109) 


(16399) 


(65168) 


English 


57.25% 


45.13% 


34.75% 


25.88% 


19.88% 


15.88% 


12.50% 


9.63% 


7.25% 


0.001%! 


(1) 


(106) 


(2659) 


(18352) 


(63299) 


(145194) 


(256838) 


(379514) 


(501400) 


English 


57.25% 


45.13% 


34.75% 


25.88% 


19.88% 


15.88% 


12.50% 


9.63% 


7.25% 


0.01%i 


(1) 


(106) 


(3243) 


(24063) 


(82896) 


(180401) 


(305292) 


(439387) 


(572056) 


English 


57.25% 


45.25% 


34.88% 


26.13% 


20.13% 


16.00% 


12.50% 


9.75% 


7.25% 


0.1%i 


(1) 


(106) 


(4491) 


(46116) 


(190765) 


(439130) 


(715127) 


(983435) 


(1237512) 


Pitches 


66.13% 


61.00% 


53.50% 


37.13% 


16.38% 


6.25% 


2.88% 


1.38% 


0.75% 


0.001%! 


(1) 


(73) 


(3549) 


(73664) 


(376958) 


(642406) 


(767028) 


(833456) 


(871970) 


Pitches 


66.13% 


61.00% 


53.50% 


37.25% 


16.38% 


6.25% 


2.88% 


1.38% 


0.75% 


0.01%! 


(1) 


(73) 


(3581) 


(76900) 


(399435) 


(684445) 


(821533) 


(898126) 


(946219) 


Pitches 


66.13% 


61.13% 


53.63% 


37.38% 


16.63% 


6.38% 


2.88% 


1.50% 


0.88% 


0.1%! 


(1) 


(73) 


(3733) 


(95838) 


(598394) 


(1096014) 


(1363610) 


(1543086) 


(1687166) 


Proteins 


52.25% 


52.13% 


51.63% 


47.50% 


25.13% 


4.63% 


0.75% 


0.25% 


0.25% 


0.001%! 


(1) 


(21) 


(422) 


(8045) 


(128975) 


(463357) 


(572530) 


(589356) 


(595906) 


Proteins 


52.25% 


52.13% 


51.63% 


47.50% 


25.13% 


4.63% 


0.75% 


0.25% 


0.25% 


0.01%! 


(1) 


(21) 


(422) 


(8045) 


(131064) 


(494845) 


(626269) 


(654067) 


(670075) 


Proteins 


52.25%, 


52.13% 


51.63% 


47.50%: 


25.50% 


4.88% 


0.88% 


0.38% 


0.38% 


0.1%! 


(1) 


(21) 


(425) 


(8076) 


(143879) 


(768510) 


(1150595) 


(1293347) 


(1403589) 


Sources 


68.75% 


46.88% 


30.00% 


19.63% 


14.38% 


11.00% 


8.38% 


6.88% 


5.75% 


0.001%! 


(1) 


(98) 


(4557) 


(29667) 


(75316) 


(130527) 


(194105) 


(259413) 


(320468) 


Sources 


68.75% 


46.88% 


30.00% 


19.63% 


14.38% 


11.00% 


8.50% 


6.88% 


5.75% 


0.01%! 


(1) 


(98) 


(5621) 


(42303) 


(102977) 


(170525) 


(244755) 


(320237) 


(391260) 


Sources 


68.75% 


47.00% 


30.25% 


19.88% 


14.63% 


11.13% 


8.50% 


7.00% 


5.88% 


0.1%! 


(1) 


(98) 


(7359) 


(104679) 


(299799) 


(498046) 


(687941) 


(872189) 


(1049051) 



Table 3.8: Empirical entropy statistics for Pseudo-Real Collection (Scheme 1) 
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File 


Ih) 


111 




11:', 


lU 


Ih, 


Hi, 


II7 


Ih 


Xml 


65.25% 


38.63% 


21.13% 


12.63% 


8.13% 


6.00% 


5.25% 


4.75% 


4.13% 


0.001%2 


(1) 


(89) 


(3325) 


(20560) 


(56120) 


(98084) 


(134897) 


(168846) 


(200451) 


Xml 


65.25% 


39.38% 


22.00% 


13.25% 


8.63% 


6.50% 


5.63% 


5.13% 


4.50% 


0.01%2 


(1) 


(89) 


(4135) 


(31042) 


(79630) 


(132163) 


(178388) 


(221499) 


(262329) 


Xml 


65.25% 


44.00% 


28.75% 


18.50% 


12.25% 


9.25% 


8.00% 


7.13% 


6.25% 


0.1%2 


(1) 


(89) 


(5255) 


(72227) 


(226418) 


(378994) 


(513539) 


(645141) 


(777226) 


DNA 


25.00% 


24.25% 


24.13% 


24.00% 


24.00% 


23.75% 


23.50% 


22.88% 


21.25% 


0.001%2 


(1) 


(5) 


(18) 


(67) 


(260) 


(1029) 


(4102) 


(16349) 


(62436) 


DNA 


25.00% 


24.25% 


24.13% 


24.13% 


24.00% 


23.88% 


23.50% 


23.00% 


21.38% 


0.01%2 


(1) 


(5) 


(18) 


(67) 


(260) 


(1029) 


(4102) 


(16369) 


(63242) 


DNA 


25.00% 


24.50% 


24.38% 


24.25% 


24.25% 


24.13% 


23.88% 


23.50% 


22.38% 


0.1%2 


(1) 


(5) 


(19) 


(70) 


(264) 


(1034) 


(4109) 


(16400) 


(65387) 


English 


57.25% 


45.13% 


34.75% 


26.00% 


20.00% 


15.88% 


12.50% 


9.63% 


7.13% 


0.001%2 


(1) 


(106) 


(2659) 


(18353) 


(63300) 


(145195) 


(256838) 


(379514) 


(501400) 


English 


57.25% 


45.50% 


35.38% 


26.50% 


20.25% 


15.88% 


12.38% 


9.50% 


7.13% 


0.01%2 


(1) 


(106) 


(3243) 


(24079) 


(83037) 


(180592) 


(305458) 


(439539) 


(572186) 


English 


57.38% 


47.75% 


39.50% 


31.13% 


23.00% 


16.63% 


12.13% 


8.88% 


6.38% 


0.1%2 


(1) 


(106) 


(4482) 


(47357) 


(202366) 


(466838) 


(749065) 


(1015587) 


(1265447) 


Pitches 


66.13% 


61.13% 


53.63% 


37.25% 


16.38% 


6.25% 


2.88% 


1.38% 


0.75% 


0.001%2 


(1) 


(73) 


(3549) 


(73664) 


(376958) 


(642406) 


(767028) 


(833456) 


(871970) 


Pitches 


66.13% 


61.13% 


53.88% 


37.50% 


16.50% 


6.38% 


2.88% 


1.38% 


0.88% 


0.01%2 


(1) 


(73) 


(3581) 


(76917) 


(399546) 


(684518) 


(821589) 


(898152) 


(946228) 


Pitches 


66.13% 


62.00% 


55.88% 


40.25% 


17.38% 


6.50% 


3.13% 


1.88% 


1.38% 


0.1%2 


(1) 


(73) 


(3742) 


(96359) 


(606175) 


(1103560) 


(1367417) 


(1545154) 


(1688526) 


Proteins 


52.25% 


52.13% 


51.63% 


47.50% 


25.25% 


4.63% 


0.75% 


0.25% 


0.25% 


0.001%2 


(1) 


(21) 


(422) 


(8045) 


(128975) 


(463357) 


(572529) 


(589356) 


(595906) 


Proteins 


52.25% 


52.13% 


51.63% 


47.63% 


25.75% 


5.00% 


0.88% 


0.50% 


0.38% 


0.01%2 


(1) 


(21) 


(422) 


(8045) 


(131079) 


(494846) 


(626306) 


(654107) 


(670114) 


Proteins 


52.25%, 


52.13% 


51.75% 


48.75%: 


.30.13% 


7.63% 


2.13% 


1.50% 


1.38% 


0.1%2 


(1) 


(21) 


(426) 


(8072) 


(143924) 


(771311) 


(1154106) 


(1297080) 


(1407901) 


Sources 


68.75% 


47.00% 


30.00% 


19.75% 


14.38% 


11.00% 


8.50% 


6.88% 


5.75% 


0.001%2 


(1) 


(98) 


(4557) 


(29667) 


(75316) 


(130527) 


(194105) 


(259413) 


(320468) 


Sources 


68.75% 


47.50% 


30.75% 


20.13% 


14.63% 


11.13% 


8.63% 


7.00% 


5.88% 


0.01%2 


(1) 


(98) 


(5615) 


(42337) 


(103082) 


(170646) 


(244874) 


(320346) 


(391369) 


Sources 


68.75% 


51.25% 


36.63% 


24.38% 


16.75% 


12.13% 


9.13% 


7.25% 


6.00% 


0.1%2 


(1) 


(98) 


(7372) 


(108997) 


(319310) 


(525914) 


(718657) 


(904022) 


(1080824) 



Table 3.9: Empirical entropy statistics for Pseudo-Real Collection (Scheme 2) 
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3.4.3 Real Texts 

Tables 3.10-3.12 give the statistics of real texts. 



j7 lie 


OlZti 


Zj 


TTDA/r 
±tr Lv± 




440MiB 


5 


4.301 


Para 


410MiB 


5 


4.096 


Clostridium Botulium 


34MiB 


4 


3.356 


Escherichia Coh 


lOSMiB 


15 


4.000 


Salmonella Enterica 


66MiB 


9 


3.993 


Staphylococcus Aureus 


38MiB 


5 


3.579 


Streptococcus Pneumoniae 


23MiB 


8 


3.836 


Streptococcus Pyogenes 


24MIB 


10 


3.800 


Influenza 


148MiB 


15 


3.845 


Coreutils 


196MiB 


236 


19.553 


Kernel 


247MiB 


160 


23.078 


Einstein (cn) 


446MiB 


139 


19.501 


Einstein (de) 


89MiB 


117 


19.264 


Nobel (en) 


85MiB 


126 


20.070 


Nobel (de) 


SlMiB 


118 


17.786 


Turing (en) 


7.7MiB 


103 


21.096 


Turing (de) 


85MiB 


100 


19.719 


World Leaders 


45MiB 


89 


3.855 



Table 3.10: Alphabet statistics for Real Collection 



File 


p7zip 


bzip2 


gzip 


ppmdi 


Re-Pair 


Cere 


1.14% 


2.50% 


26.36% 


24.09% 


1.86% 


Para 


1.46% 


26.34% 


27.07% 


24.88% 


2.80% 


Clostridium Botulium 


8.53% 


25.88% 


26.47% 


24.12% 


20.00% 


Escherichia Coli 


4.72% 


26.85% 


28.70% 


25.93% 


9.63% 


Salmonella Enterica 


5.61% 


27.27% 


28.79% 


25.76% 


12.42% 


Staphylococcus Aureus 


2.89% 


26.32% 


28.95% 


25.00% 


5.26% 


Streptococcus Pneumoniae 


4.78% 


26.52% 


27.39% 


24.78% 


9.57% 


Streptococcus Pyogenes 


5.00% 


26.25% 


27.08% 


25.00% 


9.58% 


Influenza 


1.35% 


6.62% 


7.43% 


3.78% 


3.31% 


coreutils 


1.94% 


16.33% 


24.49% 


12.76% 


2.55% 


kernel 


0.81% 


21.86% 


27.13% 


18.62% 


1.13% 


einstein.en 


0.07% 


5.38% 


35.20% 


1.61% 


0.10% 


einstein.de 


0.11% 


4.38% 


31.46% 


1.35% 


0.16% 


nobel.en 


0.13% 


2.94% 


18.82% 


1.76% 


0.20% 


nobel.de 


0.18% 


3.55% 


27.74% 


1.68% 


0.30% 


turing.en 


1.09% 


36.36% 


285.71% 


15.58% 


1.71% 


turing.de 


0.03% 


0.18% 


0.10% 


0.11% 


0.05% 


world leaders 


1.29% 


7.11% 


17.78% 


3.56% 


1.78% 



Table 3.11: Compression statistics for Real Collection 
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File 






Ho 




H^ 


Hk 


Hr 


H-r 


Hs 
118 


Cere 


27.38% 


22.63% 


22.63% 


22.50% 


22.50% 


22.50%) 


22.50% 


22.38% 


22.25% 












(^■riLO) 


(8697) 






Para 


26.50% 


23.50% 


23.38% 


23.38% 


23.38% 


23.38% 


23.25% 


23.25% 


23.13% 


{'-) 


('^) 




n 95"! 


(625) 


(3125) 


(14725) 




y J-Oy J 


Clostridium 


23.25% 


23.00% 


22.88% 


22.75% 


22.75% 


22.75% 


22.63% 


22.50% 


22.25% 


Botulium 


(^) 


C41 


(IV) 




(256) 


(1024) 


(4096) 


( 1 6383'! 


C651 1 81 


Escherichia 


25.00% 


24.75% 


24.50% 


24.38% 


24.25% 


24.25% 


24.13% 


24.13% 


23.88% 


Coli 


(1) 


(15) 


(145) 


(779) 


(2715) 


(7436) 


(15641) 


(32561) 


(85363) 


Salmonella 


25.00% 


24.75% 


24.50% 


24.38% 


24.25% 


24.13% 


24.13% 


24.00% 


23.75% 


Enterica 




(9) 




(97) 






(4159) 


(16457) 


("6561 81 

V ^ J- / 


Staphylococcus 


23.88% 


23.75% 


23.75% 


23.63% 


23.63% 


23.63% 


23.50% 


23.25% 


22.75% 


Aureus 


(1) 


(5) 


(18) 


(67) 


(260) 


(1029) 


(4102) 


(16391) 


(65282) 


Streptococcus 


24.63% 


24.38% 


24.38% 


24.25% 


24.13% 


24.13% 


24.00% 


23.75% 


23.13% 


Pneumoniae 


(1) 


K°J 


(31) 


ri33") 


C574'l 


(2183) 




C21093") 




Streptococcus 


24.50% 


24.38% 


24.25% 


24.13% 


24.13% 


24.13% 


24.00% 


23.88% 


23.25% 


Pyogenes 


(1) 


(10) 


(50) 


(174) 


(456) 


(1291) 


(4418) 


(16758) 


(65919) 


Influenza 


24.63% 
(1) 


24.13% 
(15) 


24.13% 
(125) 


24.00% 

(583) 


23.88% 
(2329) 


23.50% 
(7978) 


22.00% 
(21316) 


18.63% 
(44748) 


13.25% 
(101559) 


coreutils 


68.38% 
(1) 


51.25% 
(236) 


35.88% 
(18500) 


23.88% 
(169716) 


17.00% 
(606527) 


12.88% 
(1335553) 


10.13% 
(2258650) 


8.00% 
(3258896) 


6.50% 
(4247313) 


kernel 


67.25% 


50.50% 


36.63% 


25.75% 


19.25% 


15.13% 


12.13% 


9.63% 


7.75% 


(1) 


(160) 


(7122) 


(90396) 


(351918) 


(773818) 


(1305616) 


(1912604) 


(2553008) 


einstein.en 


62.00% 


46.38% 


33.38% 


21.13% 


13.25% 


9.00% 


6.50% 


4.75% 


3.50% 


(1) 


(139) 


(4546) 


(28685) 


(77333) 


(142559) 


(211506) 


(276343) 


(335151) 


einstein.de 


63.00% 


44.88% 


32.63% 


20.88% 


13.25% 


9.00% 


6.13% 


4.38% 


3.13% 


(1) 


(117) 


(3278) 


(16765) 


(39010) 


(64884) 


(89914) 


(112043) 


(130473) 


nobel.en 


62.63% 


44.63% 


30.50% 


18.25% 


11.50% 


8.13% 


6.00% 


4.50% 


3.38% 


(1) 


(126) 


(3566) 


(18079) 


(42334) 


(69855) 


(95644) 


(119260) 


(140401) 


nobel.de 


61.13% 
(1) 


43.25% 
(118) 


31.13% 
(2726) 


19.63% 
(12959) 


12.50% 
(30756) 


8.63% 
(49695) 


6.00% 
(66108) 


4.13% 
(80467) 


3.00% 
(92184) 


turing.en 


63.25% 


45.75% 


32.00% 


19.13% 


11.50% 


7.63% 


5.38% 


3.88% 


2.88% 


(1) 


(103) 


(2794) 


(14091) 


(33498) 


(55489) 


(75611) 


(93402) 


(108636) 


turing.de 


62.38% 
(1) 


43.25% 
(100) 


29.25% 
(1806) 


16.75% 
(7268) 


9.50% 
(15407) 


6.00% 
(23070) 


3.88% 
(29038) 


2.63% 
(33714) 


2.00% 
(37335) 


world 


43.38% 


24.38% 


17.25% 


11.63% 


7.63% 


5.13% 


4.00% 


3.50% 


3.13% 


leaders 


(1) 


(89) 


(2526) 


(23924) 


(106573) 


(246566) 


(374668) 


(468701) 


(547040) 



Table 3.12: Empirical entropy statistics for Real Collection 
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3.5 Discussion 



It can be seen in the tables presented above that only p7zip and Re-Pair capture 
the repetitiveness of the texts, achieving a compression ratio at least one order of 
magnitude better than bzip2, gzip or ppmdi. It can also be noted in Tables 3.6 
and 3.7 that p7zip is more robust to capture the repetitiveness than Re-Pair, as 
with mutation ratios of 0.1% p7zip compresses 5 times better than Re-Pair. Table 
3.11 also shows that Re-Pair fails to capture some repetitions, as for all DNA texts 
except para and cere the compression of p7zip is two times better than that of Re- 
Pair. Tables 3.6 and 3.8 also show that the compression ratio of bzip2, gzip and 
ppmdi does not change significantly when increasing the repetitiveness of the text 
(decreasing mutation ratio). However, Tables 3.7 and 3.9 show that when decreasing 
the mutation ratio from 0.1% to 0.01% the gain in compression is greater than 10%, 
but when decreasing the mutation to 0.001% the compression ratio does not improve 
as much. It can also be seen that the compression ratios of bzip2 and gzip are close 
to the H2-H^, whereas, curiously, ppmdi compression ratios are not well predicted by 
any Hk- Notice that, since artificial texts are extremely compressible, small constant 
overheads (usually irrelevant) may produce significant differences in the size of the 
compressed file. 
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Chapter 4 



LZ-End: A New Lempel-Ziv 
Parsing 

In this chapter wc explain some properties of the LZ77 parsing (see Section 2.13) 
and present a variant that has the advantage of faster text extraction. The results 
presented in this chapter were published in the 20th Data Compression Conference 
(DCC) [KNIO]. 

4.1 LZ77 on Repetitive Texts 

An interesting property of the LZ77 parsing is that it captures the repetitions of the 
text. Text repetitions, as well as single-character edits on a text, alter the number 
of phrases of the parsing very little. This explains why LZ77 is so strong on highly 
repetitive collections. 

Lemma 4.1. Given the teocts T , T' and the characters a , b; the following statements 
hold 









(4.1) 




< 


H^^'^\T$) + 1 


(4.2) 




< 


H^^^\TaT') + 1 


(4.3) 


HLZ^\TaT') 


< 


H^^^^TT') + 1 


(4.4) 




< 


H^z-i^iThT') + 1 


(4.5) 



where H^^'^'^iT) is the number of phrases of the LZ77 parsing. 
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Proof. Assume the last phrase of the LZ77 parsing of T$ is $ and that if^^''^(T$) 



last phrase would be T$ 
the last phrase of the parsing is 
of the parsing of TT would be 



AB 



n' . That means the first n' — 1 phrases cover the text T. Now, if we have the text 
TTS, we have that the first n' — 1 phrases are the same as for the parsing of T and the 

hence inequality for Equation (4.2) holds. Now, assume 
A$ for some A ^ e. Therefore the n'-th phrase 
for some B such that 1 < \B\ < \T\, thus this 
phrase does not completely cover TT. An additional phrase covers the remaining 
portion of the text, thus equality holds for Equation (4.2). The proof of Equation 
(4.1) is similar to the second part of Equation (4.2). 

Now consider Equation (4.3). Let Z[p] = XY the last phrase covering T, where 
X is a suffix of T and y is a prefix of T'. When adding the new character in the 
middle, in the worst case the phrase gets converted to Xa (this is the phrase that 
may increase the total number of phrases). Then the following phrase will cover at 
least the prefix Y, and each successive phrase will cover at least the next phrase of 
the original parsing. Hence, the number of phrases is at most one more than the 
original number of phrases. The proofs for Equations (4.4) and (4.5) are similar to 
the one above. □ 



The LZ78 parsing [ZL78] described in Section 2.14.3 is not that powerful. On 
T = a" it produces n' = ^ + 0(1) phrases, and this increases to n' = + 0(1) 
on TT. LZ77, instead, produces n' = log2(n) + 0(1) phrases on T and just one more 
phrase on TT. 



4.2 LZ-End 

In this section we introduce a new LZ-like parsing. Its main characteristic is a faster 
random text extraction, while its compression is close to that of LZ77. 

Definition 4.2. The LZ-End parsing of text T[l,n] is a sequence Z[l,n'] of phrases 
such that T = Z[1]Z[2] . . . Z[n'], built as follows. Assume we have already processed 
T[l, i— 1] producing the sequence Z[l,p—1]. Then, we find the longest prefix T[i, i' — l] 
of T[i, n] that is a suffix of Z[l\ . . . Z[q\ for some q < p, set Z\p] = T[i, i'] and continue 
with i — i' + 1. 

Example 4.3. Let T = ' alabar_a_la_alabarda$ ' ; the LZ-End parsing is as follows: 



a 


1 


ab 


ar 




a 


la 


a 


labard 


a$ 
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In this example, when generating the seventh phrase we cannot copy two characters 
as in Example 2.31, because 'la' does not end in a previous end of phrase. However, 
' 1 ' does end in an end of phrase, hence we generate the phrase ' la' . Notice that the 
number of phrases increased from 9 to 10 with respect to the original LZ77 scheme. 

The LZ-End parsing is similar to the one proposed by Fiala and Green [FG89], 
in that theirs restricts where the sources start, while ours restricts where the sources 
end. This is the key feature that will allow us extract arbitrary phrases in constant 
time per extracted symbol and, as shown in Section 4.2.2. 

4.2.1 Encoding 

The output of an LZ77 compressor is, essentially, the sequence of triplets z{p) — 
(j, i, c) , such that the source of Z[p] = T[i, i'] is T[j , j + £ — 1], i = i' — i, and c = T[i'] . 
This format allows fast decompression of T, but not decompressing an individual 
phrase Z[p] in reasonable time (one must basically decompress the whole text). 

The LZ-End parsing, although potentially generates more phrases than LZ77, 
permits a shorter encoding of each, of the form z{p) = {q, i, c), such that the source of 
Z\p\ — T[i, i'] is a suffix of Z[l\ . . . Z[q\, and the rest is as above. This representation 
is shorter because it stores the phrase identifier rather than a text position. We 
introduce a more sophisticated encoding that will, in addition, allow us to extract 
individual phrases in constant time per extracted symbol. 

• char[l,n'] (using n'floga] bits) encodes the trailing characters (c above). 

• source[l,n'] (using n'flogn'] bits) encodes the phrase identifier where the source 
ends (g above). 

• B[l,n] (using n' log ^ + 0{n' + "'g"^" ) bits in compressed form [RRR02], see 
Section 2.5) marks the ending positions of the phrases in T. 

Thus we have z{p) — {q,l,c) — {source[p], selecti{B,p -|- 1) — selecti{B,p) — 
1, char\p]). We can also know in constant time that phrase p ends at selecti{B, source\p]) 
and that it starts £ positions before. Finally, we can know that the text position i 
belongs to phrase Z[ranki{B, i — 1) -|- 1]. 
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Extract (start, len) 

1 if len > then 

2 end •<— start + len — 1 

3 p <— ranki(B. end) 

4 if B[end] = 1 then 

5 Extract (start, len — 1) 

6 output char\p\ 

7 else 

8 pos ^ selecti{B.p) + 1 

9 if start < pos then 

10 Extract (start, pos — start) 

11 len •<— end — pos + 1 

12 start pos 

13 Extract(seZecti(S, source\p + 1]) — selecti{B,p + 1) + start + 1, len) 

Figure 4.1: LZ-End extraction algorithm for T[start, start + len — 1]. 
4.2.2 Extraction Algorithm 

The algorithm to extract an arbitrary substring in LZ-End is given in Figure 4.1. The 
extraction works from right to left. First we compute the last phrase p intersecting 
the substring. If the last character is stored explicitly, i.e., it is an end of phrase (see 
Line 4), we output char[p] and recursively extract the remaining substring (line 5). 
Otherwise we split the substring into two parts. The right one is the intersection of 
the rightmost phrase covering the substring and the substring itself, and is extracted 
recursively by going to the source of that phrase (hue 10). The left part is also 
extracted recursively (line 13). 

While the algorithm works for extracting any substring, we can prove it takes 
constant time per extracted symbol when the substring ends at a phrase. 

Theorem 4.4. Function Extract outputs a text substring T[start, end] ending at a 
phrase in time 0{end — start + 1). 

Proof. If T[start, end] ends at a phrase, then B[end] = 1. We proceed by induction on 
len = end — start + 1. The case len < 1 is trivial by inspection. Otherwise, we output 
T[end\ at line 6 after a recursive call on the same phrase and length Zen — 1. This time 
we go to line 8. The current phrase (now p-|- 1) starts at pos. If start < pos, we carry 
out a recursive call at line 10 to handle the segment T[start, pos — 1]. As this segment 
ends at the end of phrase p, induction shows that this takes time 0{pos — start + 1). 
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Now the segment T[ma.x{start,pos), end] is contained in Z\p + 1] and it finishes one 
symbol before the phrase ends. Thus a copy of it finishes where Z[source\p + 1]] ends, 
so induction apphes also to the recursive call at line 13, which extracts the remaining 
string from the source instead of from Z\p + 1], also in constant time per extracted 



We have shown that the algorithm extracts any substring by starting from the 
end of a phrase. Thus, extracting an arbitrary substring may be more expensive than 
an end-of-phrase aligned one. 

Definition 4.5. Let T = Z[l]Z[2] . . . Z[n'] be a LZ-parsing of T[l,n]. Then the 
height of the parsing is defined as H — maxi<j<„ C[{\, where C is defined as follows. 
Let Z[i] — T[a, b] be a phrase which source is T[c, d\, then 



Array C counts how many times a character was transitively copied from its 
original source. This is also the extraction cost of that character. Hence, the value 
H is the worst-case bound for extracting a single character in the LZ parse. 

Lemma 4.6. In an LZ-End parsing it holds that H is smaller than the longest phrase, 
i.e., H < maxi<p<„/ \Z\p\\. 

Proof. We will prove by induction that VI < i < n, C[i] < C[i + 1] + 1. From this 
inequality the lemma follows. For all positions ip where a phrase p ends, it holds 
by definition that C[ip] = 1. Thus, for all positions i in the phrase p, we have 



The first phrase of any LZ-End parsing is TfO], and the second is either T[l] or 



T[1]T[2]. In the first case, we have C[1]C[2] = 1, 1, in the latter C[1]C[2]C[3] = 1, 2, 1. 



In both cases the property holds. Now, suppose the inequality is valid up to position ip 
where the phrase Z[p\ ends. Let ip+i be the position where the phrase Z[p+1] = T[a, b] 
ends (so a = ip + 1 and b = ip+i) and T[c, d] its source. For alHp + 1 < i < ip+i, 
C[i] — C[{i — a)-\-c]-\- 1, and since d < ip, the inequality holds by inductive hypothesis 
for -|- 1 < i < ip+i — 2. By definition of the LZ-End parsing the source of a phrase 
ends in a previous end of phrase, hence C[ip+i — l] = C[(i] + 1 = 2 < 1 + 1 = C[ip+i] + l. 
For position ip+i (end of phrase) the inequality trivially holds as it has by definition 
the least possible value. □ 



symbol. 



□ 



C[k] 
C[b] 



C[{k -a) + c] + l,ya<k <b 
1 



C[i\ <C[ip\ + ip-i<\Z\p\\. 
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The above lemma does not hold for LZ77. Moreover, the LZ-End parsing yields 
a better extraction complexity. 

Lemma 4.7. Extracting a substring of length i from an LZ-End parsing takes time 
0{i + H). 

Proof. Theorem 4.4 already shows that the cost to extract a substring ending at a 
phrase boundary is constant per extracted symbol. The only piece of the code in 
Figure 4.1 that docs not amortize in this sense is line 13, where we recursively unroll 
the last phrase, removing the last character each time, until hitting the end of the 
substring to extract. By definition of H, this line cannot be executed more than H 
times. So the total time is 0{£ + H). □ 

Remark 4.8. On a text coming from an ergodic Markov source of entropy h, the 
expected value of the longest phrase is However, as we are dealing with 

highly repetitive texts this expected length does not hold. 

Remark 4.9. Algorithm Extract (Figure 4.1) also works on parsing LZ77, but in 
this case the best theoretical bound we can prove for extracting a substring of length 
i is 0{£H) . However, the results in Section 4.5.3 suggest that on average it may be 
much better. 



4.3 Compression Performance 

We study now the compression performance of LZ-End, first with respect to the 
empirical A;-th order entropy and then on repetitive texts. 

4.3.1 Coarse Optimality 

We prove that LZ-End is coarsely optimal. The main tool is the following lemma. 
Lemma 4.10. All the phrases generated by an LZ-End parse are different. 

Proof. Assume by contradiction = for some p < p'. When Z\p'] was 

generated, we could have taken Z\p\ as the source, yielding phrase Z\p']c, longer than 
Z[p']. This is clearly a valid source as Z\p\ is a suffix of Z[l\ . . . Z\p\. So this is not 
an LZ-End parse. □ 
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Lemma 4.11 ([LZ76]). Any parsing ofT[l,n] into n' distinct phrases satisfies n' — 
O ^ log" , where a is the alphabet size ofT. 

Lemma 4.12 ([KM99]). For any text T[l, n] parsed into n' different phrases, it holds 
n'logn' < nHk{T) + n'log ^ + 6(n'(l + kloga)), for any k. 

Lemma 4.13. For any text T[l,rii\ parsed into n' different phrases, using LZ77 or 
LZ-End, it holds n'logn < nHk{T) + o{n\oga) for any k — o{\og^n). 

Proof. Arroyuelo and Navarro [AN] prove that the property holds for any LZ parsing 
for which Lemmas 4.11 and 4.12 hold. In particular it holds for LZ77 and for our 
proposal, the LZ-End. □ 

Theorem 4.14. The LZ-End compression is coarsely optimal. 

Proof. The proof is based on the one by Kosaraju and Manzini [KM99] for LZ77. 
Here we consider in addition our particular encoding (the result holds for triplets 
(g, £, c) as well). The size of the parsing in bits is 



Tl ( 

LZ-EndiT) = n' [log a] + n' [log n'l + n' log — + O n' + 

n \ 

— v! log n + O I n' log cr + 



, n log log n 

n log log n 



logn 

logn ) ' 

Thus from Lemmas 4.10 and 4.12 we have 

LZ-End(T) < nHkCT) + 2n^ \og- + o( n'ik + 1) log a + 

n' \ log n 

Now, by means of Lemma 4.11 and since n'log ^ is increasing in n', we get 

LZ-End{T) < ni/,(r) + + p ^^(^ + 1) log^ % ^loglog^ 

\ logn / \ logn logn 

nH (T) + o( ^^Qg^(^Qg^Qg^ + (^ + ^) log 
^ ^ V logn 

Thus, diving by n and taking k and a as constants, we get that the compression ratio 
is 

n log logn' 



p(T) < Hk{T) + 



logn 



□ 
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4.3.2 Performance on Repetitive Texts 



We have not found a worst-case bound for the competitiveness of LZ-End compared to 
LZ77. However, we show, on the negative side, a sequence that produces almost twice 
the number of phrases when parsed with LZ-End, so LZ-End is at best 2-competitive 
with LZ77. On the positive side, we show that LZ-End satisfies some of the properties 
of Lemma 4.1. 

Example 4.15. Let T = 112 ■ 113 • 214 • 325 • 436 • 547 • . . . • (a - 2)(ct - 3)ct. The 
length of the text is n = 3((j — 1). The LZ parsings are: 



LZ77 
LZ-End 



1 


12 


113 


214 


325 


436 


547 



o- 



2)(cr-3)a 



1 


12 


11 


3 


21 


4 


32 


5 


43 


6 


54 


7 



The size of LZ77 is n' = cr and the size of LZ-End is n' — 2{a — 1). 

For LZ-End we can only prove the following lemma regarding the concatenation 
of texts. 



Lemma 4.16. Given a text T, the following statements hold 
where H'"-^~^'^'^{T) is the number of phrases of the LZ-End parsing. 



(4.6) 
(4.7) 



Proof. Assume H^^~^^'^{T%) = n' and that the last phrase of the LZ-End parsing of 
T$ is $ . That means the first n' — 1 phrases cover the text T. Now, if we have 
the text TT$, we have that the firs t n' — 1 phrases are the same as for the parsing 
of T and the last phrase would be T$ (since T ends at the end of the [n' — l)-th 
phrase), hence inequality for Equation (4.7) holds (actually it holds H^^^^"''^{TT$) = 
ffLZ-End^rp^y^^ Now, assumc the last phrase of the parsing is A$ for some A ^ e, 
and that the prefix of T in the n' 

Then 



1 first phrases is xX, where x i s a c haracter. 
Therefore the n'-th phrase of the parsing of TT would be at least 



Ax 



the (n' + l)-th phrase will be XA$ , thus equality holds for Equation (4.7). The 



situation is analogous if the n'-th phrase extends beyond Ax. For Equation (4.6) 
consider that T is parsed into n' — 1 phrases covering the prefix xX and the last 
phrase is aAb (where x, a and b are characters and A ^ e is a string). Then, the 

and thus, the {n' + 2)-th phrase is 



l)-th phrase of TT is at least 



xXa 



Ab 



Because there must exist a phrase ending in A for phrase aAb 



the n'-th phrase is just [a] , then the (n' -|- l)-th phrase is xXa 



to exist. If, instead, 

□ 
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1 {{-l,n + l)} 

2 i 1, p 1 

3 while i < n do 



4 


[sp,ep] ^[l,n] 


5 




6 


while i + j < n do 


7 


[sp, ep] ^ BWS(sp, ep, T[i + j]) 


8 


mpos aYgmaXsp<k<epA[k] 


9 


if A [mpos] < n + 1 — i then 


10 


break 


11 


J ^ J + 1 


12 


{q, fpos) •<— Successor(J^, sp) 


13 


if fpos < ep then 


14 




15 


Insert(J^, (p, A-'^[n + 1 - (i + £)])) 


16 


output (g,£,T[i + £]) 


17 


+ p-(-p+l 



Figure 4.2: LZ-End construction algorithm. J" stores pairs (phrase identifier, suffix 
array position) and answers successor queries on the text position. BWS{sp, ep, c) 
was defined in Section 2.12. 

4.4 Construction Algorithm 

We present an algorithm to compute the parsing LZ-End, inspired by the algo- 
rithm CSP2 by Chen et al. [CPS08] . We compute the range of all text prefixes ending 
with a pattern P, rather than suffixes starting with P [FG89] . 

We first build the suffix array (Section 2.11) ^[l,n] of the reverse text, T^^^ = 
r[n - 1] . . . T[2]T[1]$, so that T''^''[A[i], n] is the lexicographically i-th smallest suffix 
of T**^^. We also build its inverse permutation: is the lexicographic rank of 

T^™[j,n]. Finally, we build the Burrows- Wheeler Transform (BWT) (Section 2.12) 

rprev^ rpbwt^^ ^ T''^^[A[i] - 1] (or T'^^ln] if = 1). 

On top of the BWT we will apply backward search (Section 2.12) to find out 
whether there are occurrences of a T[i,i' — 1] (Definitions 2.30 and 4.2). 

Since, for LZ-End, the phrases must in addition finish at a previous phrase end, 
we maintain a dynamic set J-" where we add the ending positions of the successive 
phrases we create, mapped to A. That is, once we create phrase Z\p] = T[i,i'], we 
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12 4 67 9 

^2^^r[5Z|la_alabarda$ 
jr= {5,18,14,20,4,2} 

1 z J j 5 b 7 o 'J- j-0 i-L _z I'z ±5 j-6 17 _Lj iJ zJ 21 

= 21 12 9 14 20 8 13 16 4 1 10 18 6 17 5 2 11 19 7 15 3 
= 10 16 21 9 15 13 19 6 3 11 4 20 8 5 1 

Figure 4.3: Example of LZ-End construction algorithm 

insert 74~^[n + 1 — i'] into J-. 

Backward search over T'*^" adapts very well to our purpose. By considering the 
patterns P = {T[i,i' — l])''^'-' for consecutive values of i', we are searching backwards 
for P in T'^^'", and thus finding the ending positions of T[i,i' — 1] in T, by carrying 
out one further BWS step for each new i' value. Thus we can use T naturally. 

As we advance i' in T[i, i' — 1], we test whether A[sp, ep] contains some occurrence 
finishing before i in T, that is, starting after n + 1 — i in T^^'". If it does not, then 
we stop looking for larger i' values as there are no matches preceding T[i]. For this, 
we prccompute a Range Maximum Query (RMQ) data structure [FH07] on A, which 
answers queries mpos = eiTgma.Xsp<k<ep A[k]. Then if A[mpos] is not large enough, 
we stop. 

In addition, we must know if i' finishes at some phrase end, i.e., if J-" contains 
some value in [sp, ep]. A successor query on T finds the smallest value fpos > sp in 
T. If fpos < ep, then it represents a suitable LZ-End source for T[i,i']. Otherwise, 
as the condition could hold again for a later [sp, ep] range, we do not stop but recall 
the last j = i' where it was valid. Once we stop because no matches ending before 
T[i] exist, we insert phrase Z\p] = T[i,j] and continue from i = j + 1. This may 
retraverse some text since we had processed up to i' > j. We call N > n the total 
number of text symbols processed. 

The algorithm is depicted in Figure 4.2. 

Example 4.17. Figure 4.3 shows the structures used during the parsing of the string 
' alabar_a_la_alabarda$ ' . The array A corresponds to the suffix array of the re- 
versed text and A~^ to its inverse permutation. The figure shows the parsing up to 
the 6th phrase and the values inserted in the dictionary T. The values inserted in 
J-' arc A~^[/en — i], where i is the ending position of a phrase and leu the length of 
the text. For example, the second phrase ends in position 2, thus the value inserted 
corresponds to yl~^[21 — 2] = 18. 
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Now we continue the process to generate the next phrase. First, using BWS 
we find the interval of A that represents the suffixes (of the reverse text) starting 
with '1', obtaining the range [17, 18] (right gray zone). Then we look in T for the 
successor of 17, obtaining the value 18, which is still in the range. Hence, we have 
found a valid source. Afterward, we continue with the next character. Again, with 
BWS we find the interval of A representing the suffixes (of the reverse text) starting 
with 'al' (left gray zone), which are the prefixes of the text ending with 'la'. This 
gives us the range [11, 13]. We look for the successor of 11 in which is 14. Since 
this value is outside the interval, there are no valid sources. Wc continue this process 
until there are no more possible sources. Finally, we get that the only valid source is 
'1', generating the new phrase 'la'. 



In theory the construction algorithm can work within bit space (1) niffc(T^'^^) + 
o{nloga) = nHk{T)+o{nloga) (since nHk{T) = nHkiT"-^") + 0{\ogn) [FM05, The- 
orem A. 3]) for building the BWT incrementally [GN08]; plus (2) 2n + o{n) bits for the 
RMQ structure [FH07]; plus (3) 0(n' log n) bits for a successor data structure. After 
building the BWT incrementally in time O (n log n \ i^^f^^ „ ] ) [GN08], we can make it 

static, so that it supports access to the successive characters of T in time 0( [ ipgiog^ D, 
as well to A and A~^ in time ©(log^+'^n) for any constant e > [FMMN07]. The 
RMQ structure is built in 0{n) time and within the same final space, and answers 
queries in constant time. The successor data structure could be a simple balanced 
search tree, with leaves holding O(logn) elements, so that the access time is 0(\ogn) 
and the space is n'logn(l + o(l)) [Mun86]. Thus, using Lemmas 4.11 and 4.12, the 
overall construction space is 2n{Hk{T) + 1) + o(n log a) bits, for any k = o(log^n). 
The time is dominated by the BWT construction, O (n log n \ Y^°fjg „ 1 ) , plus the N 
accesses to A, 0{N\og^'^'^ n). If, instead, we use O(nlogn) bits of space, we can 
build and store explicitly A and A~^ in 0(n) time [KS03]. The overall time becomes 

^{-^ r log log n "I ■ 

Note that a simplification of our construction algorithm, disregarding (and 
thus N = n) builds the LZ77 parsing using just n{Hk{T) + 2) + o(nlogo') bits and 
0{n logn(log'^ n-\-o{\oga))) time, which is less than the best existing solutions [OS08, 
CPS08]. 

In practice our implementation of the algorithm works within byte space (1) n 
as we maintain T explicitly; plus (2) 2.02n for our implementation of BWT (fol- 
lowing Navarro's "large" FM-index implementation [Nav09], where L is maintained 
explicitly); plus (3) An for A, which is exphcitly maintained; plus (4) 0.7n for Fis- 
cher's implementation of RMQ [FH07]; plus (5) n for A~^, using a sampling-based 
implementation of inverse permutations [MRRR03] (Section 2.7); plus (6) 12n' for a 
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balanced binary tree implementing the successor structure. This adds up to less than 
lOn bytes in practice. A is built in time O(nlogn) in practice; other construction 
times are 0{n). After this, the time of the algorithm is 0(A'"logn') = 0{N\ogri). 

As we see soon, N is usually (but not always) only slightly larger than n; we now 
prove it is limited by the phrase lengths. 

Lemma 4.18. The amount of text retraversed at any step is < \Z[p]\ for some p. 

Proof. Say the last vahd match T[i,j — 1] was with suffix Z[l] . . . Z\p — 1] for some 

p, thereafter we worked until T[i,i' — 1] without finding any other valid match, and 
then formed the phrase (with source p — 1). Then we will retraverse T[j + — 1], 
which must be shorter than Z[p] since otherwise Z[l] . . . Z\p] would have been a valid 
match. □ 

Remark 4.19. On ergodic sources with entropy h, N = logn), but as explained 
this is not a realistic model on repetitive texts. 



4.5 Experimental Results 

We implemented two different LZ-End encoding schemes. The first is as explained in 
Section 4.2.1. In the second {LZ-End2) we store the starting position of the source, 
selecti{B, source[p]), rather than the identifier of the source, source[p]. This in theory 
raises the uHkiT) term in the space to 2nHk{T) (and noticeably in practice, as seen 
soon), yet we save one select operation at extraction time (line 13 in Figure 4.1), 
which has a significant impact in performance. In both implementations, bitmap B 
is represented by (5-encoding the consecutive phrase lengths (Section 2.5.2). Recall 
that, in a (5-encoded bitmap, selecti{B,p) and selecti{B,p+l) cost 0(1) after solving 
p •<— ranki{B, end), thus LZ-End2 does no select operations for extracting. 

We compare our compressors with LZ77 and LZ78 implemented by ourselves. 
LZ77 triples are encoded in the same way as LZ-End2. We include the best performing 
compressors of Chapter 3, p7zip and Re-Pair. Compared to p7zip, LZ77 differs in 
the final encoding of the triples, which p7zip does better. This is orthogonal to the 
parsing issue we focus on in this thesis. We also implemented LZB [Ban09], which 
limits the distance dist at which the phrases can be from their original (not transitive) 
sources, so one can decompress any window by starting from that distance behind; 
and LZ-Cost, a novel proposal where we limit the number of times any text character 
can be copied (i.e., its C[-] value in Definition 4.5), thus directly limiting the maximum 
cost per character extraction. We have found no efficient parsing algorithm for LZB 
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and LZ-Cost, thus we test them on small texts only. We also implemented LZ-Begin, 
the "symmetric" variant of LZ-End, which also allows random phrase extraction in 
constant time per extracted symbol. LZ-Begin forces the source of a phrase to start 

where some previous phrase starts, just like Fiala and Green [FG89], yet phrases have 
a leading rather than a trailing character. Although the parsing is much simpler, the 
compression ratio is noticeably worse than that of LZ-End, as we will see in Section 
4.5.1. 

We used the texts of the Canterbury corpus (http://corpus.canterbury.ac. 
nz), the 50MB texts from the Pizza&Chili corpus (http : //pizzachili .dcc.uchile . 
cl), and highly repetitive texts from the previous chapter. We use a 3.0 GHz Core 2 
Duo processor with 4GB of main memory, running Linux 2.6.24 and g-|— I- (gcc version 
4.2.4) compiler with -03 optimization. 

4.5.1 Compression Ratio 

Table 4.1 gives compression ratios for the different collections and parsers. Figure 
4.4 shows the same results graphically for one representative text of each collection. 
For LZ-End we omit the samphng for bitmap B, as it can be reconstructed on the 
fly at loading time. LZ-End is usually 5% worse than LZ77, and at most 10% over it 
on general texts and 20% on the highly repetitive collections, where the compression 
ratios are nevertheless excellent. LZ78 is from 20% better to 25% worse than LZ-End 
on typical texts, but it is orders of magnitude worse on highly repetitive collections. 
With parameter log(n)/2, LZ-Cost is usually close to LZ77, yet sometimes it is much 
worse, and it is never better than LZ-End except by negligible margins. LZB is not 
competitive at all. Finally, LZ-Bcgin is about 30% worse than LZ77 on typical texts, 
and up to 40 times worse for repetitive texts. This is because not all phrases of the 
parsing are unique (Lemma 4.20). This property was the key to prove the coarse 
optimality of the LZ parsings. 

Lemma 4.20. Not all the phrases generated by an LZ-Begin parsing are different. 

Proof. We prove this lemma by showing a counter-example. Let T = Axy/lyAz, 
where x, y, z are distinct characters and ^ is a string. Suppose we have parsed up to 
Ayl, then the next phrase will be yA, and the following phrase will also be yA. □ 

The above results show that LZ-End achieves very competitive compression ratios, 
even in the challenging case of highly repetitive sequences, where LZ77 excells. 
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LZ77 LZ78 LZ-End LZ-Cost LZB LZ-Begin Re-Pair 


Canterbury 


Sizc(KiB) 




alice29.txt 


148.52 


47.17% 


49.91% 


49.32% 


48.51% 


61.75% 


59.02% 


72.29% 


asyoulik.txt 


122.25 


51.71% 


52.95% 


53.51% 


52.41% 


66.42% 


62.34% 


81.52% 


cp.html 


24.03 


43.61% 


53.60% 


45.53% 


46.27% 


66.26% 


56.93% 


78.65% 


fields.c 


10.89 


39.21% 


54.73% 


41.69% 


44.44% 


61.32% 


60.61% 


65.19% 


grammar.lsp 


3.63 


48.48% 


57.85% 


50.41% 


56.30 % 


67.02% 


67.14% 


85.60% 


lcetl0.txt 


416.75 


42.62% 


46.83% 


44.65% 


43.44% 


56.72% 


54.21% 


57.47% 


plrabnl2.txt 


470.57 


50.21% 


49.34% 


52.06% 


50.83% 


63.55% 


59.15% 


74.32% 


xargs.l 


4.13 


57.87% 


65.38% 


59.56% 


59.45% 


86.37% 


73.14% 


107.33% 




97.66 


0.055% 


0.51% 


0.045% 


1.56% 


0.95% 


0.040% 


0.045% 


alphabct.txt 


97.66 


0.110% 


4.31% 


0.105% 


0.23% 


1.15% 


0.100% 


0.081% 


random.txt 


97.66 


107.39% 


90.10% 


105.43% 


107.40% 


121.11% 


106.9% 


219.24% 


E.coli 


1529.97 


31. IS^^ 


27.70'X. 


31.72% 






35.99%. 


57.63% 


bibk-.l.xl 


3952.53 


34.18'/t 


3().27% 


3(>. 11%: 






i3.98'A. 


11.81% 


worldl92.txt 


2415.43 


29.04% 


38.52% 


30.99% 






41.52% 


38.29% 


pi.txt 


976.56 


55.73% 


47.13% 


55.99% 






57.36% 


108.08% 







LZ77 


LZ78 


LZ-End 


LZ-Begin 


Re- Pair 


Pizza Chili 


Size(MiB) 




Sources 


50 


28.50% 


41.14% 


31.00% 


41.95% 


31.07% 


Pitches 


50 


44.50% 


59.30% 


45.78% 


57.22% 


59.90% 


Proteins 


50 


47.80% 


53.20% 


47.84% 


54.95% 


71.29% 


DNA 


50 


31.88% 


28.12% 


32.76% 


34.28% 


45.90% 


Enghsh 


50 


31.12% 


41.80% 


31.12% 


38.54% 


30.50% 


XML 


50 


17.00% 


21.24% 


17.64% 


25.49% 


18.50% 








LZ77 


LZ78 


LZ-End 


LZ-Begin 


Re-Pair 


Repetitive 


Size(MiB) 




Wikipedia Einstein 


357.40 


9.97x10-2% 


9.29% 


1.01x10-1% 


4.27% 


1.04x10-1% 


World Leaders 


40.65 


1.73% 


15.89% 


1.93% 


7.97% 


1.89% 


Rich String 11 


48.80 


3.20x10"*% 


0.82% 


4.18x10-*% 


0.01% 


3.75x10-*% 


Fibonacci 42 


255.50 


7.32x10-"% 


0.40% 


5.32x10-"% 


6.07x10-"% 


2.13x10-"% 


Para 


409.38 


2.09% 


25.49% 


2.48% 


7.29% 


2.74% 


Cere 


439.92 


1.48% 


25.33% 


1.74% 


6.15% 


1.86% 


Coreutils 


195.77 


3.18% 


27.57% 


3.35% 


7.33% 


2.54% 


Kernel 


246.01 


1.35% 


30.02% 


1.43% 


3.43% 


1.10% 



Table 4.1: Compression ratio of different parsings, in percentage of compressed over 
original size. We use parameter cost — (logn)/2 for LZ-Cost and dist — n/bior LZB. 
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Compression of different Compressors 



LZ78 I 1 

LZ-End - 

Re-Pair i 




plrabn12.txt DNA Para 



Figure 4.4: Compression ratio for different compressors 

Consistently with Chapter 3, Re-Pair results show that grammar-based compres- 
sion is a relevant alternative. Yet, we note that it is only competitive on highly 
repetitive sequences, where most of the compressed data is in the dictionary. This 
implementation applies sophisticated compression to the dictionary, which we do not 
apply on our compressors. Those sophisticated dictionary compression techniques 
prevent direct access to the grammar rules, essential for extracting substrings. 



4.5.2 Parsing Time 

Figure 4.5 shows parsing times on two files for LZ77 (implemented following CSP2 
[CPS08]), LZ-End with the algorithm of Section 4.4, and p7zip. We show sepa- 
rately the time of the suffix array construction algorithm we use, libdivsuf sort 
(http : //code . google . com/p/libdivsuf sort), common to LZ77 and LZ-End. 

Our LZ77 construction time is competitive with the state of the art (p7zip), thus 
the excess of LZ-End is due to the more complex parsing. Least squares fitting 
for the nanoseconds/char yields 10.41ogn -|- 0{l/n) (LZ77) and 82.31og?T, -|- 0{l/n) 
(LZ-End) for Einstein text, and 32.6 log n + 0(l/n) (LZ77) and 127.9 log n + 0(l/n) 
(LZ-End) for XML. The correlation coefficient is always over 0.999, which suggests 
that = 0{n) and our parsing takes O(nlogn) time in practice. Indeed, across 
all of our collections, the ratio N/n stays between 1.05 and 1.37, except on aaa.txt 
and alphabet.txt, where it is 10-14 (which suggests that = uj{n) in the worst 
case). Figure 4.6 shows the total text traversed by LZ-End parsing algorithm for two 
different texts. 

The LZ-End parsing time breaks down as follows. For XML: BWS 36%, RMQ 
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PizzaChili XML File (size=282MiB) 



LZ77 




19 20 21 22 23 24 25 26 27 28 
log(n) 

Figure 4.5: Parsing times for XML and 
character. 



Wikipedia Einstein (size=357l\/liB) 




19 20 21 22 23 24 25 26 27 28 
log(n) 

Wikipedia Einstein, in microseconds per 



Total text traversal of LZ-End 




012345678 
log(size (MB)) 



Figure 4.6: Total text traversed during LZ-End construction algorithm. 



19%, tree operations 33%, SA construction 6% and inverse SA lookups 6%. For 
Einstein: BWS 56%, RMQ 19%, tree operations 17%, and SA construction 8% (the 
inverse SA lookups take negligible time). 



4.5.3 Text Extraction Speed 

Figure 4.7 shows the extraction speed of arbitrary substrings of increasing length. The 
three parsings (LZ77, LZ-End and LZ-End2) are parameterized to use approximately 
the same space, 550KiB for Wikipedia Einstein and 64MiB for XML. This is achieved 
by adjusting the sample step s of the 5-encoded bitmaps (Section 2.5.2). It can be 
seen that (1) the time per character stabilizes after some extracted length, as expected 
from Lemma 4.7, (2) LZ-End variants extract faster than LZ77, especially on very 
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repetitive collections, and (3) LZ-End2 is faster than LZ-End, even if the latter invests 
its better compression in a denser sampling. Least squares fitting for the extraction 
time of a substring of length m are given in Table 4.2. 



Pizza&Chili XML Wikipedia Einstein 



Scheme 


Model 


Scheme 


Model 


LZ77 

LZ-End 

LZ-End2 


4.44 + 0.33m 

7.40 + 0.36m 

6.41 + 0.27m 


LZ77 

LZ-End 

LZ-End2 


19.09 + 0.38m 
5.75 + 0.19m 
5.64 + 0.19m 



Table 4.2: Least squares fitting for extraction time. All correlation coefficients are 
always over 0.999. 



PrzzaChili XML File (size=282MiB) Wikipedia Einstein (size=357MiB) 




2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18 

log(extraction length) log(extraction length) 

Figure 4.7: Extraction speed vs extracted length, for XML and Wikipedia Einstein. 

We now set the extraction length to 1,000 and measure the extraction speed per 
character, as a function of the space used by the data and the sampling. Here we use 
bitmap B and its sampling for the other formats as well. LZB and LZ-Cost have also 
their own space/time trade-off parameter; we tried several combinations and chose 
the points dominating the others. Figure 4.8 shows the results for small and large 
files. 

It can be seen that LZB is not competitive, whereas LZ-Cost follows LZ77 closely 
(while offering a worst-case guarantee). The LZ-End variants dominate all the trade- 
off except when LZ77/LZ-Cost are able of using less space. On repetitive collections, 
LZ-End2 is more than 2.5 times faster than LZ77 at extraction. 
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Canterbury plrabn12.txt (size=471KiB) 



Fibonacci Sequence (size=502KiB) 




70 80 
Compression Ratio 

PizzaChili XML File (size=282l\/liB) 
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Wikipedia Einstein (size=357MiB) 




0.2 0.25 
Compression Ratio 
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Figure 4.8: Extraction speed vs parsing and sampling size, on different texts. 
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Chapter 5 

An LZ77-Based Self-Index 



In this chapter we describe a self- index based on the LZ77 parsing. It builds on the 
ideas of the original LZ-bascd index proposed by Karkkainen and Ukkonen [KU96a, 
Kar99] and the ideas presented by Navarro for reducing its space usage [Nav08] . Our 
index will be mostly independent of the type of Lempel-Ziv parsing used, and we 
will combine it with LZ77 and LZ-End. We use compact data structures to achieve 
the minimum possible space. These structures also allow one to convert the original 
index into a self-index, so that we do not need the text anymore. 

As we will show, the index includes all the structures needed to randomly extract 
any substring from the text, introduced in the previous chapter. The worst-case 
time to extract a substring of length I is 0{£H) for LZ77 and 0(£ + H) for LZ-End 
(see Section 4.2.2). Additionally, the proposed index only supports count queries by 
performing a full locate, and exists queries by essentially locating one occurrence. For 
these reasons, in the following we focus only on locate queries. 



5.1 Basic Definitions 



Assume we have a text T of length n, which is partitioned into n' phrases using 
a LZ77-like compressor (see Chapter 4). Let P[l,m] be a search pattern. We will 
call primary occurrences of P those covering more than one phrase; special primary 
occurrences those ending at the end of a phrase and being completely covered by the 
phrase; and secondary occurrences those occurrences completely covered by a phrase 
and not ending at an end of phrase. 
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Example 5.1. 



1 


2 


3 4 


5 6 


7 


8 9 


111 

12 


1111111 

3 4 5 6 7 8 9 


2 2 
1 


a 


1 


ab 


ar 




a 


la_ 


alabard 


a$ 



In this example the occurrence of ' lab ' starting at position 2 (red color) is primary 
as it spans two phrases. The second occurrence, starting at position 14 (blue color) 
is secondary. The occurrence of 'rd' starting at position 18 (green color) is special 
primary. 

We need to distinguish between these three types of occurrences, as we will find 
first the primary occurrences (including the special ones), which will be then used 
to recursively find the secondary ones (which, in turn, will be used to find further 
secondary occurrences). 

5.2 Primary Occurrences 

By definition, a primary occurrence covers at least two phrases. Thus, each primary 
occurrence can be seen as P = LR, where the left side L is a suffix of a phrase and 
the right side R is the concatenation of zero or more consecutive phrases plus a prefix 
of the next phrase. For this reason, to find this type of occurrences we partition the 
pattern in two (in every possible way). Then, we search for the occurrences of the 
left part of the pattern in the suffixes of the phrases and for the right part in the 
prefixes of the suffixes of the text starting at beginning of phrases. Then, we need 
to find which pairs of left and right occurrences actually represent an occurrence of 
pattern P: 

1. Partition the pattern P[l, m] into P[l, i] and P[i + 1, m] for each 1 < i < m. 

2. Search for the right part P[i + l,m] in the prefixes of the suffixes of the text 
starting at phrases. 

3. Search for the left part ^[l,^] in suffixes of phrases. 

4. Connect both results, generating all primary occurrences. 

5.2.1 Right Part of the Pattern 

To find the right side P[i + l,m] of the pattern we use a suffix trie (recall Sections 
2.9 and 2.14.2) that indexes all suffixes of T starting at the beginning of a phrase. In 
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the leaves of the tree we store the identifier (id) of the phrases. Conceptually, these 
form an array id that stores the phrase ids in lexicographic order (i.e., the leaves of 
the suffix trie). As we see later, we do not need to store id explicitly. 



1234567 8 9 



a 


1 


ab 


ar 




a_ 


la_ 


alabard 


a$ 




Figure 5.1: The suffix trie for the string ' alabar_a_la_alabarda$ ' . The dark node 
is the note at which we stop searching for the pattern 'la', and the gray leaves 
represent the phrases that start with that pattern. 

We will represent the suffix trie as a labeled tree using DFUDS (Section 2.8). To 
search for a pattern we descend through the tree using laheledjchild (recall Section 
2.8), and then discard as many characters of the pattern as the skip of the branch 
indicates. We continue this process cither until we reach a leaf, the pattern is com- 
pletely consumed, or wc cannot descend anymore. Our answer will be an interval of 
the array of ids, representing all phrases starting with the pattern P[i + l,m]. In 
case we consume the pattern in an internal node, we need to go to the leftmost and 
rightmost leaves in order to obtain the interval, which is computed using leaf jrank 
and represents the start and end positions in the array of the ids. 

Example 5.2. Suppose we are looking for the right pattern 'la'. Figure 5.1 shows 
in dark the node at which we stop searching for the pattern, and in gray the phrases 
that start with that pattern. The answer is the range [8,9] (i.e., the lexicographical 
order of the phrases) . 



69 



5.2 Primary Occurrences 



Chapter 5 An LZ77-Based Self-Index 



Remark 5.3. Recall from Section 2.9 that in a PATRICIA tree, after searching for 
the positions we need to check if they are actually a match, as some characters are 
not checked because of the skips. In the example presented above, the answer would 

have been the same if we were searching for any right pattern of the form Ix, where 
a; is a character distinct from a. We use a different method here, which is explained 
in Section 5.2.3. 

We do not explicitly store the skips in our theoretical proposal, as they can be 
computed from the tree and the text. Given a node in the trie, if we go to the leftmost 
and rightmost leaves, we can extract the corresponding suffixes until computing how 
many characters they share. This value will be the sum of all the skips from the 
root to the given node. However, we already know they share S characters, where 
S is the sum of all skips from the root to the previous node (i.e., the parent node). 
Therefore, to compute the skip, we extract the suffixes of both leaves skipping the 
first S characters. The amount of symbols shared by both extracted strings will be 
the skip. Extracting a skip of length s will take at most 0{sH) time both for LZ77 
and for LZ-End, since the extraction is from left to right and we have to extract one 
character at a time until they differ. Thus, the total time for extracting the skips as 
we descend is 0{mH). 

5.2.2 Left Part of the Pattern 

To find the left part P[l, i] of the pattern we have a trie (actually a PATRICIA trie. 
Section 2.9) that indexes all the reversed phrases, stored as a compact labeled tree 
(Section 2.8). Thus to find the left part of the pattern in the text we need to search 
for (P[l, i])^^" in this trie. The array that stores the leaves of the trie is called revJd 
and is stored explicitly. 

The search process and the considerations for this tree are exactly the same as 
the ones for Section 5.2.1. The only difference with the suffix trie is that the 
computation of the skips is faster now. Our text extraction algorithm works from 
right to left and since the text is reversed our algorithm outputs the characters in 
the correct order. Thus, extracting a skip of length s takes 0{sH) time for LZ77 and 
0{s + H) time for LZ-End. However, in the worst case the total time would still be 
0{mH) as all skips may be of length 1. 

Example 5.4. Suppose we are looking for the left pattern 'a'. Figure 5.2 shows in 
gray the node at which we stop searching for the pattern. In this case we end up in 
a leaf, so that is the only phrase that ends with the given pattern. The answer is the 
range [4,4]. 
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Figure 5.2: The reverse trie for the string 'alabar_a_la_alabarda$'. The gray leaf 
is the node at which we stop searching for the pattern ' a' . 

5.2.3 Connecting Both Parts 

In the previous steps we found two intervals, one in the id array and the other in 
the revJd array. These intervals represent the sets of phrases where the matches of 
the right side of the pattern start (id array interval) and the phrases ending with 
the left side of the pattern {rev Ad array interval). Actual occurrences of the pattern 
are composed of consecutive phrases. Hence, to find the occurrences of the pattern, 
we need to find which ids in the right interval are consecutive to those rev Jds in 
the left interval. For doing so we use a range structure (see Section 2.6.1) that 
connects the consecutive phrases in both trees. Figure 5.3 shows the range data 
structure connecting both trees for our example string and below the sequence that 
is represented with the wavelet tree. 

This structure is built from a permutation vr on [1, . . . , n']. This permutation is 
just an array containing for each id (column) the corresponding revJd (row). In 
other words, the permutation holds that id[i] — l-\-revAd[K{i)]. For our example the 
permutation array would be {8,6,1,7,0,3,5,2,4} (note that we count from left to 
right and from bottom to top, and that we assume that revld^] — 0). 

Example 5.5. Suppose we are looking for the pattern ' ala' . The possible partitions 
are (a, la) and (al, a). Figure 5.3 shows in gray the ranges obtained when searching 
for the left and right part of partition (a, la). Then we look for all points inside those 
ranges, obtaining the only primary occurrence that starts at phrase 1. The same 
procedure is carried out for the other partition. 
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Figure 5.3: The range structure for the string ' alabar_a_la_alabarda$ ' . The gray 
circle marks the only primary occurrence of the pattern 'ala', and the gray nodes 
show the ranges defined by the left and right part of the pattern. 



Reiiicirk 5.6. The range structure allows us to compute id[i], just storing the revJd 
array. Say we want to compute id[i]. We extract the value S[i] from the wavelet 
tree, giving us the row p where the corresponding reverse id is. Then we compute 

id[i] = 1 + revJ,d\p]. 

Example 5.7. Say we want to compute id[Q\ (i.e., the phrase id of the 6th lexico- 
graphical smallest phrase). We extract from the wavelet tree the 6th symbol, getting 
the value 3. This value is the lexicographical order of the reversed 5th phrase. Com- 
puting rev-id^] = 7, we know that the 5th phrase is phrase number 7. Hence, the 
6th phrase is phrase number 8 (i.e., id[6\ — 8). 

At this stage we also have to validate that the answers returned by the search 
query are actual occurrences, as the PATRICIA tries by themselves do not guarantee 
the pattern found is actually a match (see Remark 5.3). For the first occurrence 
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reported by the range data structure we extract the substring of length m starting at 
the reported position and check if it matches the pattern. If so we can ensure that all 
the other reported occurrences match the pattern as well, otherwise no occurrence is 
a match. This process works because all occurrences reported by both tries share all 
characters, thus all occurrences reported by the range query share all characters. We 
check the validity of the occurrences here as the range check is cheaper than extracting 
text and we want to extract text only when a candidate to complete occurrence is 
found. 

This structure adds O(logn') time to the search phase, and O(logn') time per 
primary occurrence found. 

Note that we are able to answer exists queries with the structures explained so 
far. If the number of occurrences reported by the range search is greater than one, 
then we check if one of those queries is an actual match. If there is a match, then the 
pattern is present in the text. 

5.2.4 Special Primary Occurrences 

The special primary occurrences could be found using the same steps explained above 
for primary occurrences, taking the left part of the pattern as the pattern itself and 
the right side of the pattern as the empty string e. However, we do know that looking 
for e in the suffix trie will return the complete tree, thus making the search in the 
range structure unnecessary. For this reason we call this type of occurrence special 
primary, as we search for them slightly differently from the primary ones. For these 
occurrences we just need to search for P*"*^^ in the reverse trie. 

Since the search P*'^" in the reverse trie gives us a range in the rcvJd array, we 
decided to store it explicitly instead of the id array. Furthermore, the result of the 
range search gives us positions in the revJd array. 

5.2.5 Converting Phrase Ids to Text Positions 

From the range structure we obtain the phrase id where an occurrence lies. Then we 
need to convert it to a real text position. For doing so, we use a bitmap that marks 
the ends of phrases. This bitmap is the same B used in Chapter 4 for extracting 
text. Figure 5.4 shows the bitmap for the example string. The bitmap is below the 
parsing. 
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Figure 5.4: The bitmap B of phrases for the string 'alabar_a_la_alabarda' 

The conversion between phrase ids and text positions takes constant time as fol- 
lows: 

• phrase{pos) = 1 + ranki{B , pos — 1): it gives the phrase id containing any text 
position pos. 

• firstjpos{id) — selecti{B,id — 1) + 1: position of the first character of phrase 
id. 

• last-pos{id) — selecti{B,id): position of the trailing character of phrase id. 

Recall from Section 4.2.1 that this bitmap also allows us to compute the length 
of the phrase as length{id) ~ selecti{B, id+ 1) — selecti{B, id). 

5.2.6 Implementation Considerations 

Here we explain some considerations we made when implementing our index. 

• Skips: as the average value for the skips is usually very low and computing them 
from the text phrases is slow in practice, we considered storing the skips, for 
one or for both tries, using the Directly Addressable Codes (Section 2.4.1). Note 
that in this case we never access array id nor rev Ad during the trie traversal, 
they are only accessed when checking and reporting the occurrences. 

• Biiictry Search: instead of storing the trie we can do a binary search over the 
ids (revJds) of the suffix trie (reverse trie). For the suffix trie, we do not 
have explicitly the array of ids, but as shown in Remark 5.6 wc can retrieve 
them using the range structure and the revJds array. This alternative modifies 
the complexity of searching for a prefix/suffix of P to 0{mH\ogn') for LZ77 or 
0{{m+H) logn') for LZ-End (actually, since we extract the phrases right-to-left, 
binary search on the reverse trie costs O(mlogn') for LZ-End). Additionally, 
we could store explicitly the array of ids, instead of accessing them through the 
rev ids. Although this alternative increases the space usage of the index and 
does not improve the complexity, it gives an interesting trade-off in practice. 
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5.3 Secondary Occurrences 

Secondary occurrences are found from the primary occurrences and, recursively, from 
other previously discovered secondary occurrences. 



5.3.1 Basic Idea 

The idea to find the secondary occurrences is to locate all sources (of the LZ parsing) 
covering the occurrence and then mapping their corresponding phrases to real text 
positions. To do this we use another bitmap, called bitmap of sources Bs- The 
bitmap is built by first writing in unary the amount of empty sources {e) and then 
for each position of the text writing in unary how many sources start at that position. 
In this way each 1 corresponds to a source and a represents the position where 
the sources (is) immediately preceding it start. Figure 5.5 shows the sources and 
the corresponding phrases they generate (except the empty sources), and below the 
resulting bitmap. Since there are 3 empty sources the bitmap starts with 1110, then 
are 5 sources starting at position 1, hence 111110 follows, then just one source starting 
at position 2, adding 10, and finally one for each remaining position. 

Additionally, we need a permutation Ps connecting the Is in the bitmap B of 
phrases (recall Section 5.2.5) to the Is in the bitmap Bg of sources. The sources 
starting at a given position are sorted by increasing length, thus the last 1 before a 
marks the longest source starting at that position. An example is given in Figure 
5.6. This permutation replaces the array source of Section 4.2.1 
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Bs = 1110111110100000000000000000000 

Figure 5.5: Marking sources on bitmap Bg 

For each occurrence found, we find the position pas of the corresponding to its 
starting position in the bitmap of sources. Then we consider all the Is to the left 
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sources Bg = 11101111 10100000000000000000000 

permutation Pg =\ 
phrases 5=110101101001000000101 

Figure 5.6: Permutation connecting bitmap of phrases B (bottom) and bitmap of 
sources Bs (top) 

of pos. We convert each source to its target phrase, compute its length and see if 
the source covers the occurrence. If so, we report it as a secondary occurrence and 
recursively generate all secondary occurrences from this new occurrence. In case the 
source does not cover the occurrence, we stop the process and continue processing 
the remaining occurrences. The algorithm is depicted in Figure 5.7. 

secondEiryOcc (start, len) 

1 pos •<— selecto{Bs, start + 1) 

2 source Ad •<— pos — start — 1 

3 while source Jd > do 

4 phraseJd ^ Pg^{source_id) 

5 source.start selecti{Bs, source Jd) — source Jd 

6 if source.start + len{phraseJd) > start + len then 

7 occ-pos •<— firstjpos{sourceJd) + start — source^start 

8 report oc-cjpos 

9 sccondaryOcc(occ_pos, leri) 

10 else 

11 return 

12 source Jd •<— source Jd — 1 



Figure 5.7: Searching for secondary occurrences from T[start, start+len] (preliminary 
version) 

Example 5.8. Consider the only primary occurrence of the pattern 'la' starting 
at position 2. We find the third in the bitmap of sources at position 12. Then we 
consider all ones starting from position 11 to the left. The first 1 at position 11 maps 
to a phrase of length 2 that covers the occurrence, hence we report an occurrence at 
position 10. The second 1 maps to a phrase of length 6 that also covers the occurrence, 
thus we report another occurrence at position 15. The third 1 maps to a phrase of 
length 1, hence it does not cover the occurrence and we stop. We proceed recursively 
for the secondary occurrences found at position 10 and 15. 
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Remark 5.9. The method explained above is just introductory, as it does not work 
for general LZ77-like parsings. It only works for parsings in which no source strictly 
contains another source. Is it easy to see that if a source 5'2 is strictly covered by 

another 5*1 some secondary occurrences are lost. Let M be a match of the pattern 
sought and let M be between the rightmost positions of ^2 and 5*1. Then, as 5*2 is the 
first source to the left of M, we test if it covers M, stopping the process. However, 
Si does cover M and produces a secondary occurrence, which was not detected by 
the algorithm presented above. 

Example 5.10. Let us start with the primary occurrence of the pattern 'ba' starting 
at position 4. The first source to the left is 'la', at position 2 and of length 2, 
which does not cover the pattern. Hence, the algorithm explained above would stop, 
reporting no secondary occurrences. However, to the left of this source is the source 
' alabar ' that does cover the pattern and generates the secondary occurrence starting 
at position 16. 

5.3.2 Complete Solution 

Karkkainen in his thesis [Kar99] proposes a method for converting the LZ77 parsing 
into one in which no source contains another. However, we decided not to use it as 
it increases excessively the number of phrases. Recall that our index will use space 
proportional to the number of phrases of the parsing, thus any increase in the number 
of phrases affects directly the final size of the index. 

Another proposal of Karkkainen is to separate the sources by levels, so that within 
a level no source strictly contains another, and then apply the method explained in 
Section 5.3.1 within each level. 

Definition 5.11. The depth of a source is defined as 



where cover{s) is the set of all sources containing the source s. Let 81,82 be two 
sources starting at pi,P2 and of lengths li,l2- Si is said to cover yS'2 if pi < P2 and 
Pi + h > + h- Note that, by definition, sources starting at the same position are 
not covered by each other. However, sources ending at the same position may cover 
each other. For s = s we define depth{s) = 0. 

Figure 5.8 shows the additional array storing the depths of each source. The four 
sources 'a' and the source 'alabar' have depth equal to as all of them start at the 
same position. Source 'la' has depth 1, as it is contained by the source 'alabar'. 
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depths 00 00000 1 
sources = 11101111 10100000000000000000000 

permutation Pg 

phrases 5=110101101001000000101 




Figure 5.8: The depth of the sources for the string 'alabar_a_la_alabarda$' 



The process now is similar to the idea presented earher; however, now when we 
find a source not covering the occurrence we look for its depth d and then consider 
to the left only sources with depth d' < d, as those at depth > d are guaranteed 
not to contain the occurrence. This process works because in each level it holds that 
sources to the left will end earlier than the current source, because of the definition 
of depth. Moreover, sources at higher depths to the left will also end earlier as they 
are contained in a source of the current depth to the left. 

Now the total running time to find all oca secondary occurrences given a seed 
occurrence is Q{^occ- L) and 0{^occ- L + D), where e is the parameter for computing 
the inverse permutation Pg^ (Section 2.7), L is the time to find the next element to 
the left with depth less than a given value (an operation we consider next), and D 
is the maximum depth. The additional 0{D) cost is because in the worst case after 
finding the last occurrence we will be in a source of depth D, then move to a source of 
depth D — 1, that does not yield an occurrence, and so on up to a source of depth 1. 



5.3.3 Prev-Less Data Structure 

As explained above, we need to be able to, given a position pos in an array U and a 
value V, find the rightmost position p preceding pos for which it holds U\p] < v. We 
will call this query prevLess{U,p,v). 

To solve this query we will encode U (i.e., the array of depths of Section 5.3.2) 
using a wavelet tree (Section 2.6) supporting this additional operation. The algorithm 
descends according to the bits of value v. If the value v gets mapped to a we 
recursively search in the left subtree. If the value v gets mapped to a 1 we recursively 
search in the right subtree. In this the answer could be at the left side, we 

look for the rightmost preceding pos in the bitmap of the wavelet tree node. Finding 
this takes constant time using rank and select. Finally, we return the maximum of 
the value returned by the right subtree and the rightmost zero. 

The pseudocode of the algorithm is presented in Figure 5.9. The algorithm re- 
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ceives as parameters a wavelet tree tree, a position pos, and a value v, and returns 
prevLess{array{tree),pos,v), where arrayitree) are the values represented by the 
wavelet tree. The bitmap of the wavelet tree is denoted tree.B. Function toBit 
returns to which side the value goes, and its output depends on the level. 

prevLess(iree, pos, v) 

1 / /toBit depends on the level 

2 if toBit{v,tree) = then 

3 Ipos ^ prevLess(tree.left, ranko{tree.B , pos) , v) 

4 return selecto{tree.B, Ipos) 

5 else 

6 //rightmost zero 

7 rmO •<— selectoitree.B ^rankoitree.B ,pos)) 

8 Ipos ^ prevLess{tree.right, ranki{tree. B , pos) , v) 

9 return maxjrmO, selectiitree.B, Ipos)} 

Figure 5.9: PrevLess algorithm 

As the algorithm just performs constant-time operations at each level, its total 
running time is L = O (log D) . 

If, we label each source with its depth, and label the changes to the next text 
position with a D + 1, , we can get rid of the original bitmap of sources. Since the 
wavelet tree also supports rank and select queries we have the same functionality as 
the bitmap of sources, yet with the ability to answer prevLess queries. However, as in 
practice the bitmap of sources is very sparse, we preferred to use a 5-encoded bitmap 
to represent it and the wavelet tree for the depths. 

Using this operation we can now modify algorithm secondaryOcc of Figure 5.7. 
We keep track of the maximum depth d for which there may be sources covering the 
occurrence. When a source does not cover the occurrence, we update the value of 
d. Using the operation prevLess, we move to the next candidate source. The final 
algorithm is presented in Figure 5.10. 

5.4 Query Time 

Combining all the steps gives us the total running time to find the occurrences. 
• Primciry Occurrences: the total time is 0(m(Finci^* + Find'!^'" + logn' + 
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second.aiyOcc{start, len) 

1 pos •<— selecto{Bs, start + 1) 

2 source Jd ^ pos — start — 1 

3 / /D is the maximum depth 

4 D 

5 while source-id > do 

6 phrase Jd •<— Pg^ {source Jd) 

7 source^start •<— selecti{Bs, source Jd) — source Jd 

8 if source_start + len{phraseJd) > start + len then 

9 occ_pos -r- f ir st _pos{sourceJd) + start — source.start 

10 report occjpos 

11 secondaryOcc(occ_pos, len) 

12 else 

13 (i depth[sourceJd\ — 1 

14 if (i < then 

15 return 

16 source Jd •<— prevLess{depthJree, sourceJd, d) 

Figure 5.10: Searching for secondary occurrences from T[start, start + len] 

Extractm) + occi logn'), where Find^^ is the time to search for a subpattern 
of length m in the suffix trie, Find^'" is the time to search for a subpat- 
tern of length m in the reverse trie, and occi is the number of primary oc- 
currences. Thus the time to count the occurrences in the range structure is 
O(logn'), the total time to locate the primary occurrences in the range struc- 
ture is 0(occi logn'), and Extractm is the time to extract m characters to 
verify the PATRICIA searches. Extractm, as said in Section 4.2.2, depends 
on the parsing and is 0{mH) for LZ77 and 0{m + H) for LZ-End in the worst 
case. As our experiments show later (Section 6.1), in practice the difference is 
not as drastic: LZ77 is about 3 times (for most texts) slower for long substrings 
and not much slower for short substrings. 

The Find times depend on the structures used: 

- Tries: 0{Findf^*) = 0{FindZ^) = 0{m+Skips) = 0{m+mH) = 0{mH) 
(as the time to compute all skips is 0{mExtracti) in the worst case). 

- Tries+Skips: 0{Find'^^) = 0{FindZ'') = 0{m + Skips) = 0{m) (as the 
skips are stored). 

- Binary Search: 0{Find^^) = 0(logn' ■ Extractm) if we store the ar- 
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ray of ids explicitly, otherwise the time is 0{Find^*) = 0(logn'(logn' + 
Extractm)). O^FindT^") = 0(logn' • Extractm) on LZ77, and O(mlogn') 
on LZ-End (as the extraction takes constant time per extracted symbol in 
this case). Here we save the verification of PATRICIA trees but this has 
no effect on the total complexity. 

With this the total time using tries is 0{m'^H + m log n' + occi log n'), indepen- 
dent of the parsing. When adding skips the time drops to 0{m'^+mH+m log n'+ 
occilogn') on LZ-End. When using, instead, binary searching, the time is 
0{m{m + H)logn' + occilogn') for LZ-End and 0{rn?Hlogn' + occilogn') 
for LZ77 if we store the id array explicitly, otherwise the time increases to 
0{m{logn' +m+H) logn' -\-occi logn') for LZ-End and 0{m{logn' -\-mH) logn'-l- 
occi log n') for LZ77. 

• Secondary Occurrences: the total time is 0{\occ{logD -\- D)), where D 
is the maximum depth and e is the parameter for the representation of the 
permutation (Section 2.7). Recall from Sections 5.3.1 and 5.3.2 that the time 
to find the secondary occurrences from a seed is 0{-occlogD + D). However 
in this case we are recursively locating the secondary occurrences from all the 
occurrences found, and in a worst case we could pay 0{D) for each occurrence, 
not finding new ones. 

Taking e = y^^^ gives us a total time similar to the one given for the primary 
occurrences, except that occi logn' changes to occ ■ Dlogn'. 

To solve exists queries, we basically search for the first primary occurrence. Hence 
the total time is as given for the primary occurrences replacing occi = (the details 
can be seen in Table 5.1). 

5.5 Construction 

In this section we explain the construction algorithm of the proposed index. We 
propose a practical construction algorithm, with bounded space usage and decent 
times. (See table 5.2 for a reminder of the definitions of the variables.) 

1. Alphabet mapping: as we work with standard texts that represent each 
symbol using 1 byte, we map the byte values to effective alphabet positions and 
vice versa. 
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2. LZ parsing: For LZ77 we use the algorithm CPS2 of Chen et al. [CPS08] and 
for LZ-End we use the algorithm described in Section 4.4. At this stage we 
generate three different files containing the trailing characters, the lengths of 
the sources, and the starting positions of the sources. Using suffix trees (Sec- 
tion 2.10), the LZ77 parsing can be computed in 0{n) time using 0{n) words of 
space, and the LZ-End parsing in 0{N) time using the same space. Additionally, 
the LZ77 parsing can be computed theoretically in 0(nlogn(log^ n -|- o(log(j))) 
time using n{Hk{T) -|- 2) -|- o(n log cr) bits and the LZ-End parsing in time 
0(logn(A^log'^ n + no(logcr))) using n{Hk{T) + 2) -|- n'logn + o(n log a) bits 
(Section 4.4). The practical algorithms take total time 0{n\ogn) for LZ77 and 
O(A^logn) for LZ-End (recall Section 4.4). The space usage is around 5.7n 
bytes for LZ77 and 9n bytes for LZ-End, and this is the peak space usage for 
the self-index construction. In the index we only store explicitly the trailing 
characters using n' log a bits. 

3. Bitmap of phrsises: this bitmap is easily computed from the array containing 
the lengths of the sources in time 0{n'). It uses n'log ^ + 0{n' + "^g°^" ) 
bits (Section 2.5). In practice we use (5-encoded bitmaps, using n'log^ + 
0(n'loglogn -|- ) bits, where s is the sampling step. All query times 
are then multiplied by s -|- log y. For the analysis we will assume s — logn'. 

4. Suffix trie and reverse trie: for constructing these trees, we decided to 
insert all indexed substrings in a PATRICIA trie. This is 0{n) time for the 
reverse trie, but it could be quadratic for the suffix trie (there are complex 
0(n)-time algorithms for building suffix tries [IT06]). In practice this does not 
happen and the running time is good, as the number of phrases n' generated 
by the parsing is relatively small. Of course we insert and store pointers to the 
text in the trie, rather than the whole strings. From the PATRICIA trie, we can 
extract the sorted ids, the skips and the DFUDS representation of the tree in 
time O(n'). Each tree will have at most 2n' nodes, hence they require 4n'-|-o(n') 
bits (Section 2.8) for the topology of the tree, plus 2n' label characters encoded 
using 2n' log a bits. Additionally, the rev_ids are stored using n' log n' bits. 

5. Range structure: to build the range structure we need a permutation from 
the ids of the suffix tries to the ids of the reverse trie. This is done in 0{n') 
time, inverting the permutation of the ids and then traversing the rev_ids and 
assigning each to the corresponding id. Then, the range structure is built start- 
ing from the permutation in time O(n'logn'). It uses n'logn' -|- 0(n' log log n') 
bits (Section 2.6.1). 

6. Sources depths: for computing the secondary occurrences related structures 
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we need to first compute the depth of each source. First we sort all sources 
by increasing starting position, breaking ties by decreasing length. Doing this 
we know that all parents of a source are to its left. We keep track of the 

rightmost source of depth d for each possible depth. Then for each source we 
binary search the rightmost sources and find the deepest source d that covers 
the current phrase. Afterward, we set the current source as the rightmost source 
of depth d+1. The running time of the algorithm is 0{n' logn'). 

7. Prev-Less Depth Structure: this structure is constructed in 0{n' log D) 
time as it is just a wavelet tree It uses n'logD + 0(n' log log bits (Section 
5.3.3). 

8. Source-Phrase Permutation: it takes 0{n') time starting from the ids of 
the sorted sources. It is stored using (1 + £)n'logn' + O(n') bits (Section 2.7), 
and as we set e — ]^;^, the total space is n' log n' + 0{n') bits. 

9. Bitmap of Sources: it takes 0{n') time to build from the starting positions 
of the sorted sources. It uses n'log ^ + 0{n' + "^g"^" ) bits (Section 2.5). In 
practice we use (5-encoded bitmaps, so the same considerations as for the bitmap 
of phrases apply. 

Adding up the space of all structures we get that the index requires 2n' log n + 
n'logn' + n'logD + 0(n' log cr + " 'I'og^^ " ) bits of space, which in our practical im- 
plementation is 2n' log n-\-n' log n' -\- n' log D + 0{n' log a + n' log log n) bits plus the 
skips we store. Note that in the case of binary searching we do not use tries, yet the 
asymptotic space complexity is not reduced. 

Note that our practical index space is fully proportional to n', depending on n 
only logarithmically. 

For the construction time and space of the index we have given practical figures. 
We give now two trade-offs for the theoretical upper bounds. The first, Theoryi, 
uses the least possible construction time, and the second, Theory2, the least possible 
construction space. 

• Theoryi: The space gets dominated by the O(nlogn) bits needed to build 
the parsing. All construction times are O(n'logn'), except the parsing and 
creating the PATRICIA trees. Hence, the index is built in time 0{n + n'logn') 
{0{N + n' logn') for LZ-End) using O(nlogn) bits. 

• Theory2: In Section 4.4 we showed that the LZ77 parsing can be computed in 
0(nlogn(log'^n-|-o(logcr))) time using 0{nHk{T)) + o{nlog a) bits and the LZ- 
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End parsing in time 0{logn{N log^ n + no(logcr))) using the same space. All 
data structures (except PATRICIA trees) are constructed as explained above, 
each structure requiring at most O(n'logn) = 0{nHk{T)) + o{nloga) (recall 
Lemma 4.13) and 0{n' logn') time. In the following we show that we can build 
the PATRICIA trees in time O (n' \og^^'^ n) using 0{nHk(T) + o(n log a)) bits. 
The idea is similar to that presented by Claude and Navarro [CNIO] enhanced 
with some ideas from Russo et al. [RNO08] 

1) First we build the FM-index [FM05] of T in C>(nlognj^||^) time within 
nHk{T) + o(nlogo") bits of space. 2) Then, we build the Fully-Comprcsscd Suf- 
fix Tree (FCST) [RNO08], that supports all tree operations in 0(\og^^^ n) time. 
For building the FCST, we simulate a traversal of the suffix tree starting from 
the root using Weiner links [Wei 73], which are simulated using the LF map- 
ping (see Section 2.12) over the FM-Index. During the traversal we mark all 
nodes that are at depth multiple of 5 in the implicit tree defined by the Weiner 
links, where 5 is the space/time trade-off parameter of the FCST (which takes 
o(nlogo') bits of space for 5 = log^^'^n). These nodes are stored in a simple 
array using o{n log a) bits with all the information required by the FCST con- 
struction algorithm. The running time of this process is 0{n ^^°fj^_^ ) and the 
space is o{nloga) bits. 3) We use a dynamic balanced tree to mark some of 
the nodes of the FCST; these will be the nodes of our PATRICIA tree. For 
each phrase starting position we convert it using A^^ to an FM-Index position, 
and then selectLeaf (which gives the i-th leaf) converts it to a position in the 
FCST. We mark in the balanced tree the preorder of the FCST node, as well as 
the phrase id. Then, we traverse the balanced tree from left to right computing 
LCA{xi,Xi+i) (where Xi is the current node and Xj+i is the node to the right, 
and LCA is their lowest common ancestor in the FCST) and inserting the value 
in the tree. To build the PATRICIA tree, we traverse the balanced tree again 
from left to right, creating the PATRICIA nodes, generating the parentheses 
representation and the labels of the edges. Using the operation letter of the 
FCST (which gives any letter of the path leading to a node) we retrieve the 
label. Using the FCST we determine the topology of the tree (we keep the 
current PATRICIA path in a stack; add closing parentheses and pop the stack 
until the top of the stack is an ancestor of the new node; and then we add the 
opening parenthesis for the current node and push it to the stack). This step 
runs in 0{n' \og^^^ n) time and within O(n'logn) = 0{nH}.(T)) -\- o{n\oga) 
(recall Lemma 4.13) bits of space. 

Hence, the total time is 0{n\ogn ^^°f^ ^ -\- n'log^'''^n) and the total space is 



84 



5.6 Summary 



Chapter 5 An LZ77-Based Self-Index 



0{nHk{T)) + o(n log cr). The process for constructing the reverse trie is almost 
the same, but now we do not consider the whole suffixes, because they are lim- 
ited by the phrase length. Given A''^{pos), we use the operation LAQs{d) of 
the FCST (which retrieves the ancestor with string depth d) to find which node 
we need to mark. 

As the time for constructing the PATRICIA trees gets dominated by the pars- 
ing algorithm the total space required is 0{nHk{T)) + o(n log cr) and the total 
running time is 0(nlogn(log^ n -|- o{loga))) for LZ77 and 0{logn{N\og'^ n + 
no{\oga))) for LZ-End. 

5.6 Summary 

We have presented a self-index that given a text of length n, parsed into n' different 
phrases by a Lempel-Ziv like parsing, uses space proportional to that of the com- 
pressed text, i.e., O(n'logn) -|- o{n) bits. Table 5.1 summarizes the space and time 
of the operations over the index and Table 5.2 summarizes all the parameters of the 
index. In practice, due to our sparse bitmap representation, the o{n) bits disappear 
from the space but the times of extract, exists and locate are multiplied by O(logn'). 
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Tries 



Binary Search 



Construction Time 



Theoryi: 0(n + n'logn') for LZ77 and 0{N + n'logn') for 
LZ-End 

Thcory2: 0(n log n(log^ n + o(logcr))) for LZ77 and 

0(logn(A^log'^n + no(logcr))) for LZ-End 

Practice: 0(n log n) for LZ77, O(A^logn) for LZ-End. 



Construction Space 
Index Space 



Extract Time 
Exists Time 



Theoryi: O(ralogn) bits 
Theorya: 0{nHk{T)) + o(n log a) 
Practice: LZ77 ~ 6n bytes, LZ-End 



9n bytes 



Theory: 2n' log n + n' log n' + n'\ogD + {n' log cr " " ) bits 
Practice: 2n' log n + n' log n' + n' log D + 0{n' log a + n' log log n) 
bits 

LZ77: 0{mH) , LZ-End: 0(m + H) 



0{ni?H + m log n') 
With skips and LZ- 
End: 0{m? + mH + 
m log n) 



0{m,^H + mlogn' -|- 
occ ■ Dlogn') 
With skips and 
LZ-End: 0{m'^ + 
mH + m log n' + occ • 
D log n') 



Using n' log n' additional bits: LZ77: 
0{m'^Hlogn'), LZ-End: O {m{m + H) log n'). 
Otherwise: LZ77: 0(m(log n' -|- mi?) log n'), 
LZ-End: 0(m(log n' + m + H) log n') 



Locate Time 



Using n' log n' additional bits: LZ77: 
0{m?H\ogn' + occ ■ Dlogn'), LZ-End: 
0{m{m + H) logn' + occ ■ Dlogn'). 
Otherwise: LZ77: 0(m(log n' + mH) log n' + 
occ ■ Dlogn'), LZ-End: 0(m(logn' + m + 
H) log n' + occ ■ D log n') 



Table 5.1: Summary table of LZ77-Index. Adding skips requires at most An' logn 
more bits, but far less in practice. In practice times are multiplied by O(logn'). 



Parameter 


Description 


Defined in 


a 


size of the alphabet 


Section 2.1 


n 


length of the text 


Section 2.1 


n' 


length of the LZ parsing 


Definitions 2.30 and 4.2 


m 


length of the pattern 


Section 2.2 


s 


sampling slop of ()-cucodcxl l)i(niap 


Section 2.5.2 


£ 


parameter of the permutation 


Section 2.7 


D 


maximum depth of the sources 


Section 5.3.2 


H 


height of the LZ parsing 


Definition 4.5 


N 


total text retraversed in the LZ-End parsing 


Section 4.4 



Table 5.2: Summary table of parameters of LZ77-Index 
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6.1 Experimental Setup 

In our tests we compared the proposed index against RLCS A [NM07] . We did not test 
the SLP Index [CN09] because we could not make it run consistently in our collec- 
tions, yet some comparison results can be inferred from their experimental evaluation 
[CFMPNIO], as we do in Section 6.2. 

We used in our experiments the LZ77 and the LZ-End parsings. For the LZ indexes 
we used the following variants (defined in Section 5.2.6), ordered by decreasing space 
requirement. In all variants we stored the skips of the trees using DAC (Section 2.4.1), 
because not using them lead to results worse than using binary search (our slowest 
variant). 

1. Suffix trie and reverse trie (original proposal). 

2. Binary search on ids with the explicit ids and reverse trie. 

3. Binary search on reverse ids and suffix trie. 

4. Binary search on ids with explicit ids and binary search on reverse ids. 

5. Binary search on ids with implicit ids and binary search on reverse ids. 

Recall from Section 5.2.6 that the array of the ids is not stored in the index, only 
the array of reverse ids. Thus, if we want to binary search over ids we have two 
alternatives: (1) spend n'logn' additional bits to store explicitly the array of ids, or 
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(2) using Remark 5.6 to access the array implicitly by paying O(logn') access time. 
The index variants with explicit ids refer to the alternative (1) and the ones with 
implicit ids refer to alternative (2). The alternative using the suffix trie do not need 
to access the id array, but rather the reverse ids array, which is always maintained in 
explicit form 

Remcirk 6.1. The reader may note that the results concerning the alternative 

• Binary search on ids with implicit ids and reverse trie 

is not present. Wc omit the empirical results of this alternative as the compression 
ratio is about the same obtained using alternative number 3 and the performance of 
locate is noticeably worse. Remember that accessing the implicit array of ids takes 
time O(logn'). 

The parameters used for the data structure are as follows: s = 16 for the (5-codes 
bitmap (Section 2.5.2), e = 1/32 for the permutation (Section 2.7) and sampling step 
b — 20 for the bitmaps of Gonzalez et al. (Section 2.5.1). We used these parameter 
values as they are the typical ones used in experimentation, and additionally with 
these values our indexes achieve a good space/time trade-off. 

For RLCSA we used samphng with steps 512, 256, 128 and 64. The index was 
built using a buffer of lOOMiB. 

All our experiments were conducted on a machine with two Intel Xeon CPU 
running at 2.00GHz with 16GiB main memory. The operating system is Ubuntu 
8.04.4 LTS with Kernel 2.6.24-27-server. The compiler used was g++ (gcc version 
4.2.4) executed with the -03 optimization flag. 

We present the results obtained for the following texts (the results of only one text 
from each collection are presented in this section, the remaining results are presented 
in Appendix A): 

• Artificial: F41, R13, T29. 

• Pseudo-Real: DNA 0.1% (Scheme 1), Proteins 0.1% (Scheme 1), English 0.1% 
(Scheme 2), Sources 0.1% (Scheme 2). 

• Real: Para, Cere, Influenza, Escherichia Coli, Coreutils, Kernel, Einstein (en), 
Einstein (de). World Leaders. 
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We restricted the experiments to the texts hsted above since many of the texts 
produced similar results during preliminary experiments. For DNA and Wiki do- 
mains, we chose the largest texts. For the case of pseudo-real texts we kept the DNA, 
Proteins, English and Sources texts, as these kind of texts naturally form repetitive 
collections. 

Two types of experiments were carried out. One, labeled "results (1)" in Figures 
6.2-6.7 and A.1-A.26 , considers the time of the operations as a function of the pattern 
length \P\. The second, labeled "results (2)" in Figures 6.2-6.7 and A.1-A.26 , shows 
the space/time trade-off of the operations. Although we do not show a space/time 
tuning for LZ77/LZ-End index, the plots of figures labeled "results (2)" show a fine 
formed by 5 points. These refer to the variants 1-5 described above. The results are 
presented and discussed in Section 6.1. The experiments conducted are the following: 

• Construction time and space: we present the build time for each index as 
well as the peak memory usage. Results are given in Figure 6.1. 

• Compression Ratio: we present the compression ratio for different self-indexes. 
We show alternatives 1 and 5 of our indexes, which are respectively the largest 
and smallest variants. For RLCSA we show the space achieved with a samphng 
step of 512, and without the samples, which is the lowest space reachable by 
that index. For the LZ78-index [ANS06] (Section 2.14.3) we used e — as the 
sampling step of the permutation. Additionally, we show the compression ratio 
of ILZI [RO08] (Section 2.14.3). The results are shown in Table 6.1, where we 
also include p7zip as a baseline. 

• Structures Space: we present the space usage of the different data structures 
used in our indexes. The results are given in Tables 6.4 and 6.5 as percentage 
of the size of the index. 

• Parsing Statistics: we present the value of D and H, which affect the perfor- 
mance of our indexes. The results are displayed in Tables 6.2 and 6.3. 

• Extraction speed: we extracted 10,000 substrings of length 2\i E {0, . . . , 12}. 
We show only one line for the LZ77 and the LZ-End index, as all the variants 
have the same extraction speed. See the top-left plots of Figures 6.2-6.7 labeled 
"results (1)" , which are representative of all the results (the rest are in Appendix 
A, in Figures A.1-A.26). We also show the space/time trade-off of the indexes 
for extracting a pattern of length 2^^. Extraction times per character stabilize 
at this length. See the top-left plot of Figures 6.2-6.7 and A.1-A.26 labeled 
"results (2)". 
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• Search time: we located 1,000 patterns of length 10, 15, and 20. We limited 
the number of occurrences reported to 30,000. See plots 2-4 in reading order 
of Figures 6.2-6.7 and A.1-A.26 labeled "results (2)". We also located patterns 

of increasing length from 5 to 40. In this case we only show the results for 
alternative 1 (original proposal) and 5 (minimum space) of both LZ77 and LZ- 
End. See top-right plot of Figures 6.2-6.7 and A.1-A.26 labeled "results (1)". 

• Locate time: wc located 1,000 patterns of length 2 and 4. This test highlights 
the time needed to find the occurrences in our indexes, as it dominates the 
time for traversing the tries. We limited the number of occurrences reported to 
100,000. See plots 5-6 of Figures 6.2-6.7 and A.1-A.26 labeled "results (2)". 

• Exists Time: we generated 20,000 patterns of lengths 5, 10, 20, 40 and 80; half 
of them were present in the text and the other half were a random concatenation 
of symbols of the text. For RLCSA we check the existence using a count query. 
For this reason, we only show one line for RLCSA, as count time is independent 
of the sampling size. The exists query of the LZ77 index is basically a search 
of a primary occurrence, and thus it illustrates the time for traversing the tries. 
We only show the results for alternative 1 (original proposal) of both LZ77 and 
LZ-End, since the other alternatives are orders of magnitude slower for this. 
Sec left-bottom plot of Figures 6.2-6.7 and A.1-A.26 labeled "resuhs (1)" for 
existing patterns and right-bottom plot for non-existing patterns. We also show 
the space/time trade-off of these two queries for patterns of length 20, see plots 
7-8 of Figures 6.2-6.7 and A.1-A.26 labeled "resuhs (2)\ 
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Index Construction Time index Construction Space 






LZ77 Index 
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0.73 
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0.70 
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English 0.1% ^ 
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5.79 


3.56 
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Einstein (en) 


0.47 
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3.30 
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Einstein (de) 


0.37 


5.76 


2.77 


8.02 
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World leaders 
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8.17 
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9.20 



Figure 6.1: Construction time and space for the indexes. Times are in seconds per 
MiB and spaces are the ratio between construction space and text space. 
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0.59 
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49.74 
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Table 6.1: Compression ratio (given in percentage of original file size) of different 
self-indexes. In bold are highlighted those LZ-based indexes outperforming the best 
compression achievable by RLCSA. 
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Table 6.2: D value (i.e., maximum depth) and mean depth for the LZ indexes 
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Table 6.3: H value (i.e., maximum extraction cost) and mean extraction cost for the 
LZ indexes 
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Table 6.4: Detailed space of LZ77 index structures. Values are in percentage of the 
total size. 
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Table 6.5: Detailed space of LZ-End index structures. Values are in percentage of 
the total size. 
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Figure 6.2: T29 results (1). Note the logscales. 
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Figure 6.3: T29 results (2). Note the logscales. 
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Figure 6.4: DNA 0.1% ^ results (1). Note the logscales. 
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Figure 6.5: DNA 0.1% ^ results (2). Note the logscales. 
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Figure 6.6: Kernel results (1). Note the logscales. 
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Figure 6.7: Kernel results (2). Note the logscales. 
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6.2 Analysis of the Results 

It can be seen from the results presented above that our indexes are competitive with 
RLCSA and in most cases they show better space/time trade-offs. 

Figure 6.1 shows that our indexes are built more efficiently than RLCSA. The 
space needed to build the LZ77 index is about 60% of that of RLCSA, and for the case 
of the LZ-End index the space is about 85% of that of RLCSA. The construction time 
for LZ77 is about 35% of that of RLCSA, yet for LZ-End it is about 185% of that of 
RLCSA (i.e., slower). Our space occupancy during construction is a great advantage 
against RLCSA as it allows us to build the index for larger texts using the same 
resources. We could reduce construction space further by sacrificing construction 
time, recall Section 4.4. 

The compression ratio of our indexes is usually superior to that of RLCSA (see 
Table 6.1). When considering the lower bound of RLCSA, which only supports count 
and exists queries, our smallest index compresses better than RLCSA for all except 
texts influenza, coreutils and World Leaders. When considering RLCSA with a sam- 
pling step equal to 512, our compression is better for all except text coreutils. From 
now on we compare our indexes with the RLCSA with sampling step 512. The com- 
pression difference is more noticeable for artificial texts, where our compression is 
100-1000 times better than RLCSA. For DNA collections Para and Cere our best 
compression space (always achieved with LZ77) is about 45% of RLCSA's, yet it 
raises to 70-80% for Influenza and Escherichia Coli. The space is also 80% on World 
Leaders. On kernel our space is 60% of RLCSA, but they are almost the same on 
Coreutils. On Wikipedia articles our space is 20-30% of that of RLCSA. LZ-End 
needs more space that LZ77, losing to RLCSA in Influenza, Escherichia Coli, Core- 
utils and World Leaders. For texts of pseudo-real collections the compression ratio 
of LZ77 is about 60% of that of RLCSA, and even the alternative using more space 
has better compression ratios than RLCSA. It can also be seen in Figures 6.2-6.7 and 
A.1-A.26 labeled "results (2)" that in most cases the space/time trade-off is in favor 
of LZ77/LZ-End. Our LZ77 indexes use 2.63-7.52 times the space of p7zip and our 
LZ-End indexes use 4.57-23.03 times the space, depending on the type of text and 
the number of structures we use (e.g., tries). Finally, the compression ratios of LZ78- 
based indexes are not competitive at all (note that ILZI's maximal parsing performs 
better than LZ78). 

We expected to improve the compression of RLCSA for highly repetitive texts, 
since LZ77 is more powerful to detect repetitions (see Lemma 4.1 and Section 2.14.1). 
For artificial texts, the most repetitive ones, this situation is even more clear. For 
such texts, most of the space of the RLCSA corresponds to the space to represent the 
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samplings, which is difficult to compress [MNSVIO]. 

Tables 6.2 and 6.3 show that D is generally moderate (below 42), and that the 
greatest extraction cost is also moderate (at most 257 steps in LZ-End and at most 
98 steps in LZ77 except for the texts of Wikipedia). When taking into account the 
mean values of the depth and extraction cost, the values decrease noticeably, being 
the average extraction cost below 25 steps for LZ-End and below 32 steps for LZ77 
(except for the texts of Wikipedia) . 

Tables 6.4 and 6.5 show that the suffix trie and the reverse trie use more space 
than the rest of the data structures. Then we see that the range structure, the reverse 
ids and the permutation use roughly the same space (they require n' log n'+o{n' log n') 
bits). For LZ-End we see that the space needed to represent both bitmaps increases 
noticeably, being even higher to that of the range structure, or the permutation. 
Finally, we see that the skips only use 11-13% and the wavelet tree of depths uses 
2-4% of the total space. Additionally, for the artificial texts, wc sec that more than 
90% of the space is used to represent the suffix trie and the reverse trie. This is 
because our implementation of DFUDS stores a boosting table of constant size IkiB, 
and for these texts the number of phrases n' is less than 100, being the space of 
the table considerably larger than the space of the remaining data structures. We 
also note that our implementation stores the labels of the tree using 1 byte for each 
symbol. 

The extract time of LZ-End index is better than RLCSA in all texts, being at least 
twice as fast, and up to 10 times faster for short passages. The LZ77 index extracts 
substrings of length up to 50 faster than RLCSA. When taking into account the 
space/time trade-offs (top-left plot of Figures 6.2-6.7 and A.1-A.26 labeled "results 
(2)'') wc see that our indexes improve RLCSA both in extraction time and space, 
dominating the curve defined by RLCSA, excepting the texts where LZ-End loses in 
space, in which case no one dominates. The extraction space/time tradc-off is better 
than that of RLCSA, because since RLCSA cannot compress the sampling, it has to 
use a sparse sampling to be competitive in space. 

The performance of locate queries is related to the pattern length. This is because 
in our indexes the locating cost is quadratic on the pattern length (yet, this is in the 
worst case; in practice many searches are abandoned earlier). It can be seen in plots 
2-6 of Figures 6.2-6.7 and A.1-A.26 labeled "results (2)" that for patterns of length 
2 or 4 all of our indexes are significantly faster than RLCSA. This is because our 
indexes are much faster to locate each occurrence, as this is the cost that dominates 
for short patterns. However, when we increase the length of the patterns, the increase 
in cost is noticeable for the alternatives using binary search, which are those using 



102 



6.2 Analysis of the Results 



Chapter 6 Experimental Evaluation 



least space. However, the alternative number 1, using tries, shows a time basically 
independent of \P\ (except somewhat in DNA 0.1%^ and Escherichia Coli). This is 
also seen in the top-right plot of Figures 6.2-6.7 and A.1-A.26 labeled "results (1)" 
. RLCSA time is almost insensitive to \P\, thus in several cases it becomes faster 
for longer patterns (which also have fewer occurrences, for reporting which RLCSA 
is slower). 

By analyzing the performance results of SLPs [CFMPNIO] it is clear that the 
compression ratio of SLPs (at least when using Re-Pair to create the grammar) is 
worse than that of RLCSA. For the case of DNA (Para, Cere and Influenza) the 
compression ratio is more than twice that of the LZ77 indexes. Furthermore, the 
locate time of SLPs is only comparable to RLCSA for small patterns (|P| < 6), in 
which case our indexes show a space/time trade-off much better than that of RLCSA. 

Finally, we have that exists queries arc solved consistently faster by RLCSA than 
by our indexes. Looking at plots 3 and 4 of Figures 6.2-6.7 and A.1-A.26 labeled 
"results (1)" we notice that our larger variants are comparable to RLCSA, although 
always slower, in the case of patterns present in the text. The difference widens in 
the case of non-existent patterns, as RLCSA improves more sharply. Moreover, in 
our indexes the time increases with the length, opposite to RLCSA where the time 
is practically constant when the pattern docs not exist. In plots 7 and 8 of Figures 
6.2-6.7 and A.1-A.26 labeled "results (2)" one can see the trade-off of exists queries. 
They show that binary search is not an alternative if we are interested in this type 
of queries. For the case of patterns present in the text, binary searching the queries 
takes about 10 times more, and for patterns not present in the text about 10-1000 
times more, than the time needed using tries. 
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We have presented a new full-text self-index based on the Lempel-Ziv parsing. This 
index is especially well suited for applications in which the text is highly repetitive and 
the user is interested in finding patterns in the text (locate) and accessing arbitrary 
substrings of the text {extract). Our indexes provide a much better space/time trade- 
off than the previous ones for these operations. 

The compression ratio of our indexes is more than 10 times better than previous 
indexes based on LZ78, which are shown to be inappropriate for very repetitive texts. 
Additionally, the compression ratio of our smallest index is, for almost all texts (13 
out of 16) better than the lower bound achievable using RLCSA [SVMN08], the best 
previous self-index for these texts. When compared to the smallest practical RLCSA, 
the compression of our index is better for all except one text, usually by a factor of 2 
at least. Compared to pure LZ77 compression, our index takes usually 3-6 times the 
space achieved by p7zip. 

We also introduced a new LZ-parsing called LZ-End, which is close to LZ77 in 
compression ratio and gives faster access to text substrings. The extraction speed 
when using LZ-End is always better than that of RLCSA, and the extraction speed 
of our LZ77-based index is also superior for small substrings. Our indexes are always 
better for locating the occurrences of short patterns (of length up to 10), and the 
results are mixed for longer ones. This is because our locating time is quadratic and 
depends also in the extraction speed, which shows different behaviors according to 
the text. 

^Wc could not devise any characteristic that explains why the compression ratio of RLCSA for 
those texts is superior to our indexes. 
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The only operation for which RLCSA is consistently better than our indexes is 
for answering if a pattern is present in the text {exists), the difference being even 
more notorious for the case of non-existent patterns. Similarly, our indexes cannot 
count the number of occurrences without locating them all, whereas the RLCSA can 
do this very fast. Nevertheless, it has been argued [ANIO] that these two queries arc 
used in much more specific applications serving as a basis for complex tasks such as 
approximate pattern matching or text categorization, while extracting and locating 
are the most important for general applications. 

An interesting goal for future research would be to reduce the factor of the 
locate query time to just m. This improvement would make our index even more 
attractive. This has been achieved for other LZ-based indexes [AN07, RO08], yet 
these have been built on LZ parsings that are too weak for very repetitive texts. 

Another line of research would be to design new LZ-like compression schemes 
allowing fast decompression of random substrings of the LZ parsing. Note that our 
only trade-off related to extraction speed is the use of LZ-End instead of LZ77, and 
still LZ-End takes constant time per extracted symbol only in certain cases. In a 
recent work, Kuruppu et al. [KPZIO] use a single document as the dictionary for 
the LZ77 algorithm, storing that source document in plain form. This method is 
a heuristic and works fairly well enough only when the documents of the collection 
are not successive versions, as in collections of DNA. However, even the compression 
achieved for DNA collections (para and cere) is almost the same than the compression 
we achieved using our best LZ77 variant, yet we have a self-index and they are only 
able to extract text, although their extraction times are more than 100 times faster 
than ours. Nevertheless, this method is orthogonal to our index proposal and we 
could build our self-index on top of their compression scheme. 

Another important line of research is to devise an LZ parsing algorithm that uses 
space proportional to that of the final compressed text. Currently, to build the LZ77 
parsing one needs about six times the space of the original text. Although this space 
is lower than that of RLCSA, is still too much to handle very large text collections. 
Alternatively, a parsing algorithm working in secondary storage would also be useful 
to handle very large collections. We are aware that this is a more than challenging 
task, as the parsing process is strongly related with dynamic self-indexes for repetitive 
texts. That is, if we have a dynamic self-index (or at least an index able to insert 
strings at the end) we can easily produce the LZ parsing of the text. Hence, studying 
how we can build a dynamic LZ-based index is a natural research direction. 

It would be also interesting to study if counting could be answered more effi- 
ciently, and if more meaningful operations like approximate pattern matching could 
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be implemented, or if some operations of the suffix tree could be simulated on the 
index. 

Another interesting research goal is to decrease the space factor, both in theory 
and in practice. Compared to a pure LZ77 compressor, the factor is 4 in theory 
and 3-6 in practice, as mentioned. Such a reduction has been achieved for Arroyuelo 
and Navarro's LZ-index, reducing the factor from 4 [Nav04] to (2 + e) [ANS06] (see 
Section 2.14.3). This was possible because there was some redundancy between the 
components of the index. We could also reduce the factor by coding the bitmaps 
of the wavelet tree of depths in compressed form [RRR02] (see Section 2.5), since 
most depths are very low in practice and only some are high. However, the space 
improvement would not be too impressive, since the space of wavelet tree of depths 
is just 2-4% of the index size. 

We have also presented a text corpus oriented to repetitive texts. The main goal 
of this corpus is to become a reference set in experimentation with this kind of texts. 
The corpus is publicly available at http : / /pizzachili . dec . uchile . cl/repcorpus . 
html. 

Finally, our implementation has been left public in the site http : //pizzachili . 
dec .uchile . cl/indexes/LZ77- index, to promote its use in real- world and research 
applications and to serve as a baseline for future developments in repetitive text 
indexing. 
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Appendix A 
Experimental Results 



In this appendix we present the results of the experiments described in Section 
for the remaining texts. 
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Figure A.l: F41 results (1). Note the logscales. 
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Figure A.2: F41 results (2). Note 
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the logscales. 
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Figure A. 3: -R13 results (1). Note the logscales. 
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Figure A.4: R13 results (2). Note 
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the logscales. 
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Figure A. 6: Proteins 0.1% ^ results (2). Note the logscales. 
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Figure A. 7: English 0.1% ^ results (1). Note the logscales. 
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Figure A. 8: English 0.1% ^ results (2). Note the logscales. 
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Figure A. 9: Sources 0.1% ^ results (1). Note the logscales. 
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Figure A. 10: Sources 0.1% ^ results (2). Note the logscales. 
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Figure A. 11: Para results (1). Note the logscales. 
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Figure A. 12: Para 



results (2). Note 
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the logscales. 
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Figure A. 13: Cere results (1). Note the logscales. 
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Figure A. 14: Cere 



results (2). Note 
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the logscales. 
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Figure A. 16: Influenza results (2). Note the logscales. 
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Figure A. 17: Escherichia Coh results (1). Note the logscales. 
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Figure A. 18: Escherichia Coh results (2). Note the logscales. 
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Figure A. 19: Coreutils results (1). Note the logscales. 
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Figure A. 20: Coreutils results (2). Note the logscales. 

134 



Appendix A Experimental Results 



Extraction Speed 
Einstein (en) 



Locate Time 
Einstein (en) 




log(Snippet Length) 



Exist Time for Patterns Found 
Einstein (en) 



RLCSA512 
RLCSA256 - 
RLCSAhor ■ 
RLCSAfi. - 
LZ77i . 
LZ775 - 
LZ-Endi 
LZ-Endg ■ 






— i ! : 















10 15 20 25 30 35 40 

Pattern Length 



Exist Time for Patterns not Found 
Einstein (en) 




log(Pattern Length/5) 



log(Pattern Length/5) 



Figure A. 21: Einstein (en) results (1). Note the logscales. 
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Figure A. 22: Einstein (en) results (2). Note the logscales. 
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Figure A. 23: Einstein (de) results (1). Note the logscales. 



137 



Appendix A Experimental Results 




Locate Time (|P|=10) 
Einstein (de) 



0.016 
0.014 
0.012 
0.01 
0.008 
0.006 
0.004 
0.002 




0.016 
0.014 
0.012 
0.01 
0.008 
0.006 
0.004 
0.002 




Compression Ratio 



Locate Time {|P|=15) 
Einstein (de) 







RLCSA — •— 






LZ77 — •— 






LZ-End — ■— - 















Compression Ratio 



Locate Time (|P|=2) 
Einstein (de) 





, RLCSA — •— 




1 LZ77 — 




\ LZ-End - 







Compression Ratio 



Exist Time tor Patterns Found (|P|=20) 
Einstein (de) 




0.016 
0.014 
0.012 
0.01 
0.008 
0.006 
0.004 
0.002 




0.1 



0.016 
0.014 
0.012 
0.01 
0.008 
0.006 
0.004 
0.002 




0.016 
0.014 
0.012 
0.01 
0.008 
0.006 
0.004 
0.002 







Compression Ratio 



Locate Time (|P|=20) 
Einstein (de) 




Compression Ratio 



Locate Time(|P|=4) 
Einstein (de) 



RLCSA - 
LZ77 - 
LZ-End 



RLCSA - 
LZ77 ■ 
LZ-End - 





■ RLCSA — •— 




\ LZ77 — 




\ LZ-End ■ - 







Compression Ratio 



Exist Time for Patterns not Found (|P|=20) 
Einstein (de) 




Compression Ratio 



Compression Ratio 



Figure A. 24: Einstein 



(de) results (2). 
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Note the logscales. 
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Figure A. 25: World Leaders results (1). Note the logscales. 
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Figure A. 26: World Leaders results (2). 
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Note the logscales. 



